xml-dev - Re: need for defining standard APIs for xml storage

Re: need for defining standard APIs for xml storage
[ Lists Home | Date Index | Thread Index ]
From: Dongwook Shin <dwshin@nlm.nih.gov>
To: xml-dev@xml.org
Date: Fri, 31 Mar 2000 08:56:46 -0500
There seems to be some errors in moderating this mailing list.
I posted a couple of messages recently, but did never appear
in the list. So, I posted one of them again.

Dongwook


--------------------------------------------------------
Hi, gopi:

I am very happy to see that you raise the necessity of API for XML
storages. I disagree with the argument that you can do all with DOM
interface. It looks like the same as you can run any application as
in-memory.

I have been developing XML indexing and retrieval engines, which can

scale up to large XML collections. And I also see that the similar
ones that only creates DOM and searches in DOM fail to scale
in large collections.

Every time I develop XML IR systems, I need API for XML storage.
The only time I want to invoke DOM is in indexing, which is usually
performed off-line. In retrieval, I do not want to rely on DOM,
since it may spend huge amount of memory, which seems crucial
in degrading retrieval performance. Instead, I want to use light-weight
index that maps elements to real data. To create such kind of
index without depending on specific repositories, it seems important
to have a well-defined API for XML storages.

Dongwook

gopi wrote:

    >Hi Gopinath!
    >
    >Now almost all of xml storage products have a DOM interface to
    communicate
    with
    >the storage. Such as XDBM or Ozone-db. Or all of them try to
    build their
    API
    >close to the DOM one at least. I know, there is nothing in DOM
    spec about
    >Document retrieving and storing (but Nodes).  But in most cases
    (I was
    >encountered) one can create by means of xml database the
    model of
    filesystem
    >folders and store Documents as subtrees of folder nodes. There
    is lack of
    >features you can get by per-Document-DTD, but it provides the
    similarity of
    >namespaces. i.e. you may consider the set of documents as a
    big nested one.
    >
    >As about optimal storing, existed xml engines store the document
    in
    accordance
    >with the DOM-inspired model. i.e. linked Nodes. I think there is
    nothing to
    >standardize.
    >
    >The high-level applications such as XPath or XQL are built on the
    top of
    DOM
    >(IMHO) so they work fine if you provide them any DOM complicant
    interface.

            I think the problem is here, if some XQL query is like get
all the
    nodes
    where "root/Book/@price > 35", do you want the XPath or XQL
    processor to
    start searching the entire xml document using DOM APIs.  Certainly
    not.  You
    will definitely expect some kind of "indexing" and "caching" in
order
    to
    improve performance.  In order to do indexing, it should get some
    information from storage engine (like physical address in case of
    persistent
    storage) while indexing. If the path "root/Book/@price" is indexed
    and while
    storing the xml document, the storage engine would return the
    physical
    address of each "root/Book" node and @price value.  So this helps
    in
    "minimizing tree traversal using DOM APIs calls".  So, we expect
    some
    standard way of getting this information (physical address) when
    storing the
    DOM node.  When the xml document is updated using some XML
    query language,
    which in turn calls DOM APIs, the "implementation" of DOM APIs
    should update
    the xml storage engine and return the retrieved value to "indexing"
    part.
            If the query is something like get all "root/Book/title"
where
    "root/Book/@price > 35" then the XQL processor can make query
    first to XIC
    (as in fig) to get the DOM node for "root/Book" and then use DOM
    API
    Node.getFirstChild() which returns <title> element.  Wouldn't this
    one be
    efficient?  There will be minimal disk accesses (I can say, minimal
    storage
    engine API calls to retrieve the DOM node).
            I guess this diagram comes correctly in mailing list,

                    -----------------------
    -------------                    -----------------------    xml
storage

         -------------------------
                    | XPath or          |------->|XML parser       | dom

    apis   | DOM
    |------------------------------>|xml storage engine|
                    | XQL processor|             |
    |------------>|implementation
    |<----------------------------  |       (XSE)           |
                    -----------------------
    ------------------------                --------------
    ---------        return value from xml
    --------------------------
                            |
    |          storage engine.
                            |
    updates|using information given by "XSE"

|                                                               |

|                                                               V
                            |       use indexing on advanced search
            -----------------------------------

    |------------------------------------------------------------->
    | xml
    indexing component  |

|
    (XIC)          |

    -----------------------------------

            In this, both xml storage engine and xml indexing components

    can be
    completely different products and if there is some standard APIs
    defined and
    DOM uses these implementation to store xml document and update
    any indexing
    (if set) the XPath or XQL processor while ask the "indexing"
    component first
    to get the address where DOM node is stored and then it will go to
    storage
    engine to get the information.  I think this diagram is self
    explanatory.
    If some one thinks, it can be improved feel free to scratch some
    more lines
    on the diagram :-)
    >
    >On Mon, 27 Mar 2000, gopi wrote:
    >
    >> Hi all,
    >>      If this is not the right mailing list to discuss about this,

    suggest
    >> exact mailing list.
    >>      Right now there is no standard set of APIs defined by any
    >> organization on "how XML should be stored to make optimal
    processing".
    The
    >> XML support what we see in either MS-SQL or Oracle 8i are not
    useful in
    >> long run.  If you want to do advanced operations on xml data.
    They just
    >> retrieve the result set data in xml form and when you make any
    changes to
    >> the document, it will not be reflected in actual database (since
it
    is
    just
    >> one way of representing result set for them). If any query is
    made on XML
    >> (using XPath or XQL or XMLQL) data already stored in db,
    either they
    won't
    >> support at all or they may make relational data query to retrieve

    result(in
    >> future). Once xml data is stored in relational db they lose
    context
    >> information (heirarchy information). Even if they store it in
some
    form
    (in
    >> future), it will be inefficient when XQL query is made. The main
    problem
    is
    >> with "the way XML data is stored", it is not stored in native XML

    form
    >> (heirarchial form).
    >>      There are some projects going on to store "native" xml
    document like
    >> in www.dbxml.org.  But if no standard APIs are defined for
    storing and
    >> retrieving XML document(DOM tree) in storage engine, it will end
    up with
    >> everyone having their own way of storing xml document.  If
    somebody wants
    >> to switch over from one database vendor(or product providing
    xml native
    >> storage)  to another, it will not be easy.  Also, it will be
difficult
    to
    >> use these products as "components" with other products.
    There will not
    any
    >> standard APIs to interact with any xml native storage product.
    Why can't
    >> there be one standard set of APIs defined which every database
    vendor
    (who
    >> would like to support "native" xml storage) satisfy. If this
    happens,
    there
    >> will be efficient xml storage engines in the market and which can

    be
    >> replaced with other one if user wants.
    >>      Probable advantages:
    >>              1. It will be easy to integrate the parser with a
xml
    storage
    >> engine. Parser implementation can use these standard APIs to
    store or
    >> retrieve information from storage. Any updates or queries can
    be using
    >> parser provided APIs also.
    >>              The parser can have some API like
    >>                      parser.setStorageEngine(StorageEngine
    storageEngine);
    >>              and interface StorageEngine is implemented by
    vendors.
    >>              2. XSLT processor can resolve "XPath" expressions
    >> efficiently.  There can be "xml-caching" module and
    "xml-indexing" module
    >> interacting with "xml-storage" to do optimal data access.
    >>              3. XQL processor can use "xml-caching" and
    "xml-indexing"
    >> modules to retrieve or update xml data.
    >>      Requirements:
    >>      1. These APIs should be generic such that the xml data can
    be stored
    >> either completely in memory or stored persistently.  Then
    current DOM
    >> implementation will become just one case of this (ie.,
    completely in
    >> memory).

    >How do you see the mechanism of this? In case of persistance
    there is
    another
    >method may be added to DOM: flush() to free out memory buffer.
    This thing
    is
    >done in some xml storage engines.
    >
            This need not be DOM API it can be storage engine API and it

    will be
    defined in "xml storage engine standard APIs".  So, for in-memory
    DOM
    implementation it won't do anything, but in other cases, it may
    commit all
    the changes to storage engine and free any "cached" information.
    >
    >>      2. These APIs should be well defined so that any future APIs

    for
    >> "xml-caching" and "xml-indexing" can be defined wrt this. I hope
    this
    >> doesn't look stupid and makes sense to atleast one more
    person :-).
    >> regards, Gopinath
    >>
    >
    >Best Regards,
    >
    >Nikita Vinokurov /Project Manager/ mailto:n.vinokurov@mtu-net.ru

    >Yagel Open Source
    >http://yagel.newmail.ru



**************************************************************************

    This is xml-dev, the mailing list for XML developers.
    To unsubscribe,
    mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
    List archives are available at http://xml.org/archives/xml-dev/


**************************************************************************

--
Dongwook Shin
Visiting Scholar
Lister Hill National Center for Biomedical Communications
National Library of Medicine,
8600 Rockville Pike Bethesda 20894, MD
E-mail: dwshin@nlm.nih.gov
Tel: (301) 435-3257
FAX: (301) 480-3035
URL: http://dlb2.nlm.nih.gov/~dwshin



***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
Prev by Date: Re: xml search engine?
Next by Date: Bad use of XML ?
Previous by thread: RE: need for defining standard APIs for xml storage
Next by thread: canonical XML
Index(es):
- Date
- Thread