OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: need for defining standard APIs for xml storage

[ Lists Home | Date Index | Thread Index ]
  • From: Dongwook Shin <dwshin@nlm.nih.gov>
  • To: xml-dev@xml.org
  • Date: Mon, 03 Apr 2000 15:00:03 -0400

Hi, Martin:

Martin said:

> It seems also that we need two kind of queries
> a) queries based on the elements.
> b) queries based on the data content.
> The last query is needed when not all information is tagged. In this
case we
> end up with a situation where the knowledge is stored in the data
> but not tagged and therefore need to be indexed to be easily

Very close, but not exactly. Most in the literature categorizes the
into two or three
    a) Structural queries
    b) Content queries
    c) Attribute queries (This can be considered as a subset of a))

But, content queries are still in the context of elements. For instance,
a query

"find a SPEECH whose SPEAKER contains 'hamlet'" is regarded as a content

query even if it states a certain element relationship. On the other
structural queries are the one that address only the relationship among
like "Find SPEECH having at least THREE SPEAKER elements".

Whatever you define they are, I think it seems theoretically clearer to
all the data contents are tagged, even though they are actually not. For

if you happen to get a plain text, you can assume that they are enclosed

by tags <DOC> </DOC>, or whatever you want. By doing so, you can take
the legacy plain text into XML framework with minimal overhead.

Martin said:

> So, if the DOM would include a function such as: node-set =
> (queryType, Expression) then we can have any kind of queries applied
on an
> information set without having to add a new function each time we add
a new
> query type.

> Now about your indexes, what kind of algorithm are you using  for the
> elements?

I mean whatever you extend to the DOM, you get into the same situation.
Basically, DOM is a representation of the whole XML document.
On the other hand, the index is a small set of pointers to actual data.
If you have a query like "find a SPEECH whose SPEAKER contains
you have to search the whole DOM, which is not scalable to large
On the other hand, if you have the inverted index for the document,
you can get the elements having "hamlet" immediately.

Martin said:

> now, you can also imagine that one of the nodes represent a topic and
> this topic is a keyword located in different documents.
> You can as well choose to have an xlink element to point to an XML
> instead of an xinclude:include element.
> The point here to note is that a permanent information set can have
> content in diverse forms, but still show an XML face. So, structured
> unstructured content can be freely intermixed. B+* trees can be used
> retrieve structured content and text indexing indexes used to retrieve

> unstructured content. So, as soon as we talk about XML storage it is
> to have an engine that can wraps this back end diversity. To hide the
> that the XML hierarchy does not necessarily comes from a text
> Finally that this diversity is resolved by a small set of data types
> nodes, node-sets, etc...
> In fact, all this stuff is what the GROVE is about. Now the DOM should

> evolve to be not only an interface to parsed text documents but also
> interface to information sets. An information set is not necessarily
> from a text document. In fact, information sets could be the latest
> incarnation of hierarchical databases. Or the latest incarnation of an

> aggregation tool. We are slowly evolving toward that goal.

It seems to me that the notion of "permenant information set" looks like

data repository. The first issue here is that how you store the data
and refer to it elsewhere. And another is how a query space (the
space a query should look at) should be: should it be limited to the
XML fragment, or extended to following links? Your GROVE seems to
be one solution for that.


Dongwook Shin
Visiting Scholar
Lister Hill National Center for Biomedical Communications
National Library of Medicine,
8600 Rockville Pike Bethesda 20894, MD
E-mail: dwshin@nlm.nih.gov
Tel: (301) 435-3257
FAX: (301) 480-3035
URL: http://dlb2.nlm.nih.gov/~dwshin

This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS