OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: need for defining standard APIs for xml storage

[ Lists Home | Date Index | Thread Index ]
  • From: "Didier PH Martin" <martind@netfolder.com>
  • To: "Dongwook Shin" <dwshin@nlm.nih.gov>, <xml-dev@xml.org>
  • Date: Mon, 3 Apr 2000 12:15:44 -0400

Hi Dongwood

Dongwood said:
Very close, but not exactly. Most in the literature categorizes the queries
into two or three
    a) Structural queries
    b) Content queries
    c) Attribute queries (This can be considered as a subset of a))

Didier replies:
I was referring more by element queries something more like structural
queries. I have to admit that the term structural query is better chosen
than the term element query. However, by content query I meant queries on
content which is not structured with the XML format like for instance an
HTML document, a PDF or a word document. Or content wich is structured with
XML but not meaningful enough to facilitate information retrieval

Off course I can create an element and include the whole document as data
content for this element as you suggested, but the structure does not help
me to retrieve the content here. So the classical text indexing tools are
more useful in that case. So let's reduce the query family to two kind:
a) structured
b) unstructured

(a) is available when the document has an XML structure but not necessarily.
Do not forget the case of an XHTML document having its content packaged as
<p> elements. It does not help me to retrieve the content to have a big
bunch of <p> elements. So, in that case we may consider the document
unstructured and access its information component with unstructured document
queries techniques (ie. indexing). In some cases,  where some data content
is big enough, I can further index the data content even if it is enclosed
by a meaningful element. So, to be more precise, we can have queries based
on the structural elements of XML (elements, attributes, etc.) and queries
based on classical indexing techniques. But the main difference for the
classical indexing is to replace the unit of information retrieval to be an
element instead of a document. So, the index points to an element not a
document. This is especially useful in the case of a permanent information
set (i.e a GROVE).

Dongwood said:
I mean whatever you extend to the DOM, you get into the same situation.
Basically, DOM is a representation of the whole XML document.
On the other hand, the index is a small set of pointers to actual data.
If you have a query like "find a SPEECH whose SPEAKER contains 'hamlet'",
you have to search the whole DOM, which is not scalable to large document.
On the other hand, if you have the inverted index for the document,
you can get the elements having "hamlet" immediately.

Didier replies:
not necessarily. Just isolate the interfaces from the category name and do
not take any consideration of the Document Object Model name which is a very
bad frame for any evolution of these interfaces.

What we have in our hands is objects having particular interfaces. All
objects inherit from the same base interface and thus we can say that all
objects inherit form the same base class: the node. Then further value is
added by augmenting new interfaces to the basic interface. This is the
inheritance mechanism. Just take the case of an SVG element which adds a lot
more value to the basic node interface. Now, let's say that these objects
are simply an API used to access elements stored in a permanent information
set (i.e. a grove). In this case, we deal with object nodes and therefore we
can use some patterns like the observer or other pattern used to navigate in
the whole tree or we can provide a member to access any element. This is
what's behind the SelectNodes function. Whatever the object you obtained
from the permanent information set, you can obtain a new one with this
function. This without having to permanently keep a root object. You do not
even have to get the notion of a document. Just the notion of an entry
point. Also, this removes the dependency to the document object which in the
case of a permanent information set does not make sense if the information
set has been composed of several XML text documents. However, if the confort
level of this abstraction is acceptable then we can perceive the whole
library as a single document. However, I agree that we have to be very
careful with our metaphors.

So, to make things clearer I should probably talk about a new model that
re-use the same interfaces as the DOM is offering. The context here is no
longer the context of a single document but more of the library :-))

Dongwood said:
It seems to me that the notion of "permenant information set" looks like
data repository. The first issue here is that how you store the data
and refer to it elsewhere. And another is how a query space (the document
space a query should look at) should be: should it be limited to the current
XML fragment, or extended to following links? Your GROVE seems to
be one solution for that.

Didier replies:
Yes it is. If we pay attention to what the DOM is in the context of a
browser we can say that it is a transient data repository. The browser
obtained a serialized version of the information set in a serialized format.
This serialized format takes the form of an XML document. The browser
re-build the information set from the serialized version. Then locally,
several agents like scripts, like a style sheet engine can use this
information set.

Now, imagine that I have not a human driven agent like a browser on one end
but a computer or an automatic agent. This latter sends a query to obtain an
information set. The information set could be a tiny fragment of a bigger
one stored on the provider side. The provider replies by sending a
serialized version of the information set: an XML document. The receiver
un-serialize if (i.e. parse it) and either a) build a transient information
set or b) insert the received information set in an other one. A permanent
one. Thus what these computers transmitted is an information set
representing a fragment of their respective information set.

So, when we speak of permanent information sets, we are no longer in the
same space as the browsers are and the universe is not restricted to the
single transmitted document. In fact, the producer extract a fragment of its
own information set and send this fragment to the receiver. This latter can
include this information set fragment into its own. We just created an
abstract model representing the information space between these two agents.

So, what if this information set fragment is stored in a directory service,
this information set becomes a directory service node. Then inside that
node, I have other nodes that represent the received information set. But
you can always consider the directory service as a huge document and obtain
from it a serialized version in the form of an XML document and later on
apply a style sheet on it to make it proper to consumption to our senses.

Didier PH Martin
Email: martind@netfolder.com
Conferences: Web Chicago(http://www.mfweb.com)
             XML Europe (http://www.gca.org)
Book: XML Professional (http://www.wrox.com)
column: Style Matters (http://www.xml.com)
Products: http://www.netfolder.com

This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS