[
Lists Home |
Date Index |
Thread Index
]
- From: Dongwook Shin <dwshin@nlm.nih.gov>
- To: martind@netfolder.com, xml-dev@xml.org
- Date: Mon, 03 Apr 2000 10:33:24 -0400
Hi, Martin:
Martin said:
> It seems also that we need two kind of queries
> a) queries based on the elements.
> b) queries based on the data content.
>
> The last query is needed when not all information is tagged. In this case we
> end up with a situation where the knowledge is stored in the data content
> but not tagged and therefore need to be indexed to be easily retrieved.
Very close, but not exactly. Most in the literature categorizes the queries
into two or three
a) Structural queries
b) Content queries
c) Attribute queries (This can be considered as a subset of a))
But, content queries are still in the context of elements. For instance, a query
"find a SPEECH whose SPEAKER contains 'hamlet'" is regarded as a content
query even if it states a certain element relationship. On the other hand,
structural queries are the one that address only the relationship among
elements,
like "Find SPEECH having at least THREE SPEAKER elements".
Whatever you define they are, I think it seems theoretically clearer to assume
that
all the data contents are tagged, even though they are actually not. For
instance,
if you happen to get a plain text, you can assume that they are enclosed
by tags <DOC> </DOC>, or whatever you want. By doing so, you can take all
the legacy plain text into XML framework with minimal overhead.
Martin said:
> So, if the DOM would include a function such as: node-set = selectNodes
> (queryType, Expression) then we can have any kind of queries applied on an
> information set without having to add a new function each time we add a new
> query type.
> Now about your indexes, what kind of algorithm are you using for the
> elements?
I mean whatever you extend to the DOM, you get into the same situation.
Basically, DOM is a representation of the whole XML document.
On the other hand, the index is a small set of pointers to actual data.
If you have a query like "find a SPEECH whose SPEAKER contains 'hamlet'",
you have to search the whole DOM, which is not scalable to large document.
On the other hand, if you have the inverted index for the document,
you can get the elements having "hamlet" immediately.
Martin said:
> now, you can also imagine that one of the nodes represent a topic and that
> this topic is a keyword located in different documents.
>
> You can as well choose to have an xlink element to point to an XML fragment
> instead of an xinclude:include element.
>
> The point here to note is that a permanent information set can have its
> content in diverse forms, but still show an XML face. So, structured and
> unstructured content can be freely intermixed. B+* trees can be used to
> retrieve structured content and text indexing indexes used to retrieve
> unstructured content. So, as soon as we talk about XML storage it is better
> to have an engine that can wraps this back end diversity. To hide the fact
> that the XML hierarchy does not necessarily comes from a text document.
> Finally that this diversity is resolved by a small set of data types like
> nodes, node-sets, etc...
>
> In fact, all this stuff is what the GROVE is about. Now the DOM should
> evolve to be not only an interface to parsed text documents but also an
> interface to information sets. An information set is not necessarily coming
> from a text document. In fact, information sets could be the latest
> incarnation of hierarchical databases. Or the latest incarnation of an
> aggregation tool. We are slowly evolving toward that goal.
>
It seems to me that the notion of "permenant information set" looks like
data repository. The first issue here is that how you store the data
and refer to it elsewhere. And another is how a query space (the document
space a query should look at) should be: should it be limited to the current
XML fragment, or extended to following links? Your GROVE seems to
be one solution for that.
Thanks
Dongwook
--
Dongwook Shin
Visiting Scholar
Lister Hill National Center for Biomedical Communications
National Library of Medicine,
8600 Rockville Pike Bethesda 20894, MD
E-mail: dwshin@nlm.nih.gov
Tel: (301) 435-3257
FAX: (301) 480-3035
URL: http://dlb2.nlm.nih.gov/~dwshin
***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
|