[
Lists Home |
Date Index |
Thread Index
]
- From: Dongwook Shin <dwshin@nlm.nih.gov>
- To: xml-dev@xml.org
- Date: Fri, 31 Mar 2000 08:56:46 -0500
There seems to be some errors in moderating this mailing list.
I posted a couple of messages recently, but did never appear
in the list. So, I posted one of them again.
Dongwook
--------------------------------------------------------
Hi, gopi:
I am very happy to see that you raise the necessity of API for XML
storages. I disagree with the argument that you can do all with DOM
interface. It looks like the same as you can run any application as
in-memory.
I have been developing XML indexing and retrieval engines, which can
scale up to large XML collections. And I also see that the similar
ones that only creates DOM and searches in DOM fail to scale
in large collections.
Every time I develop XML IR systems, I need API for XML storage.
The only time I want to invoke DOM is in indexing, which is usually
performed off-line. In retrieval, I do not want to rely on DOM,
since it may spend huge amount of memory, which seems crucial
in degrading retrieval performance. Instead, I want to use light-weight
index that maps elements to real data. To create such kind of
index without depending on specific repositories, it seems important
to have a well-defined API for XML storages.
Dongwook
gopi wrote:
>Hi Gopinath!
>
>Now almost all of xml storage products have a DOM interface to
communicate
with
>the storage. Such as XDBM or Ozone-db. Or all of them try to
build their
API
>close to the DOM one at least. I know, there is nothing in DOM
spec about
>Document retrieving and storing (but Nodes). But in most cases
(I was
>encountered) one can create by means of xml database the
model of
filesystem
>folders and store Documents as subtrees of folder nodes. There
is lack of
>features you can get by per-Document-DTD, but it provides the
similarity of
>namespaces. i.e. you may consider the set of documents as a
big nested one.
>
>As about optimal storing, existed xml engines store the document
in
accordance
>with the DOM-inspired model. i.e. linked Nodes. I think there is
nothing to
>standardize.
>
>The high-level applications such as XPath or XQL are built on the
top of
DOM
>(IMHO) so they work fine if you provide them any DOM complicant
interface.
I think the problem is here, if some XQL query is like get
all the
nodes
where "root/Book/@price > 35", do you want the XPath or XQL
processor to
start searching the entire xml document using DOM APIs. Certainly
not. You
will definitely expect some kind of "indexing" and "caching" in
order
to
improve performance. In order to do indexing, it should get some
information from storage engine (like physical address in case of
persistent
storage) while indexing. If the path "root/Book/@price" is indexed
and while
storing the xml document, the storage engine would return the
physical
address of each "root/Book" node and @price value. So this helps
in
"minimizing tree traversal using DOM APIs calls". So, we expect
some
standard way of getting this information (physical address) when
storing the
DOM node. When the xml document is updated using some XML
query language,
which in turn calls DOM APIs, the "implementation" of DOM APIs
should update
the xml storage engine and return the retrieved value to "indexing"
part.
If the query is something like get all "root/Book/title"
where
"root/Book/@price > 35" then the XQL processor can make query
first to XIC
(as in fig) to get the DOM node for "root/Book" and then use DOM
API
Node.getFirstChild() which returns <title> element. Wouldn't this
one be
efficient? There will be minimal disk accesses (I can say, minimal
storage
engine API calls to retrieve the DOM node).
I guess this diagram comes correctly in mailing list,
-----------------------
------------- ----------------------- xml
storage
-------------------------
| XPath or |------->|XML parser | dom
apis | DOM
|------------------------------>|xml storage engine|
| XQL processor| |
|------------>|implementation
|<---------------------------- | (XSE) |
-----------------------
------------------------ --------------
--------- return value from xml
--------------------------
|
| storage engine.
|
updates|using information given by "XSE"
| |
| V
| use indexing on advanced search
-----------------------------------
|------------------------------------------------------------->
| xml
indexing component |
|
(XIC) |
-----------------------------------
In this, both xml storage engine and xml indexing components
can be
completely different products and if there is some standard APIs
defined and
DOM uses these implementation to store xml document and update
any indexing
(if set) the XPath or XQL processor while ask the "indexing"
component first
to get the address where DOM node is stored and then it will go to
storage
engine to get the information. I think this diagram is self
explanatory.
If some one thinks, it can be improved feel free to scratch some
more lines
on the diagram :-)
>
>On Mon, 27 Mar 2000, gopi wrote:
>
>> Hi all,
>> If this is not the right mailing list to discuss about this,
suggest
>> exact mailing list.
>> Right now there is no standard set of APIs defined by any
>> organization on "how XML should be stored to make optimal
processing".
The
>> XML support what we see in either MS-SQL or Oracle 8i are not
useful in
>> long run. If you want to do advanced operations on xml data.
They just
>> retrieve the result set data in xml form and when you make any
changes to
>> the document, it will not be reflected in actual database (since
it
is
just
>> one way of representing result set for them). If any query is
made on XML
>> (using XPath or XQL or XMLQL) data already stored in db,
either they
won't
>> support at all or they may make relational data query to retrieve
result(in
>> future). Once xml data is stored in relational db they lose
context
>> information (heirarchy information). Even if they store it in
some
form
(in
>> future), it will be inefficient when XQL query is made. The main
problem
is
>> with "the way XML data is stored", it is not stored in native XML
form
>> (heirarchial form).
>> There are some projects going on to store "native" xml
document like
>> in www.dbxml.org. But if no standard APIs are defined for
storing and
>> retrieving XML document(DOM tree) in storage engine, it will end
up with
>> everyone having their own way of storing xml document. If
somebody wants
>> to switch over from one database vendor(or product providing
xml native
>> storage) to another, it will not be easy. Also, it will be
difficult
to
>> use these products as "components" with other products.
There will not
any
>> standard APIs to interact with any xml native storage product.
Why can't
>> there be one standard set of APIs defined which every database
vendor
(who
>> would like to support "native" xml storage) satisfy. If this
happens,
there
>> will be efficient xml storage engines in the market and which can
be
>> replaced with other one if user wants.
>> Probable advantages:
>> 1. It will be easy to integrate the parser with a
xml
storage
>> engine. Parser implementation can use these standard APIs to
store or
>> retrieve information from storage. Any updates or queries can
be using
>> parser provided APIs also.
>> The parser can have some API like
>> parser.setStorageEngine(StorageEngine
storageEngine);
>> and interface StorageEngine is implemented by
vendors.
>> 2. XSLT processor can resolve "XPath" expressions
>> efficiently. There can be "xml-caching" module and
"xml-indexing" module
>> interacting with "xml-storage" to do optimal data access.
>> 3. XQL processor can use "xml-caching" and
"xml-indexing"
>> modules to retrieve or update xml data.
>> Requirements:
>> 1. These APIs should be generic such that the xml data can
be stored
>> either completely in memory or stored persistently. Then
current DOM
>> implementation will become just one case of this (ie.,
completely in
>> memory).
>How do you see the mechanism of this? In case of persistance
there is
another
>method may be added to DOM: flush() to free out memory buffer.
This thing
is
>done in some xml storage engines.
>
This need not be DOM API it can be storage engine API and it
will be
defined in "xml storage engine standard APIs". So, for in-memory
DOM
implementation it won't do anything, but in other cases, it may
commit all
the changes to storage engine and free any "cached" information.
>
>> 2. These APIs should be well defined so that any future APIs
for
>> "xml-caching" and "xml-indexing" can be defined wrt this. I hope
this
>> doesn't look stupid and makes sense to atleast one more
person :-).
>> regards, Gopinath
>>
>
>Best Regards,
>
>Nikita Vinokurov /Project Manager/ mailto:n.vinokurov@mtu-net.ru
>Yagel Open Source
>http://yagel.newmail.ru
**************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe,
mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
**************************************************************************
--
Dongwook Shin
Visiting Scholar
Lister Hill National Center for Biomedical Communications
National Library of Medicine,
8600 Rockville Pike Bethesda 20894, MD
E-mail: dwshin@nlm.nih.gov
Tel: (301) 435-3257
FAX: (301) 480-3035
URL: http://dlb2.nlm.nih.gov/~dwshin
***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
|