OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: xml search engine?

[ Lists Home | Date Index | Thread Index ]
  • From: "Martin Bryan" <mtbryan@sgml.u-net.com>
  • To: <xml-dev@xml.org>
  • Date: Tue, 4 Apr 2000 09:08:25 +0100

It strikes me that this debate is missing something.

The advantage of XML queries over other forms of query is that you can use context to identify the subset of information within a document that you need to search to find a meaningful result. Instead of having to look at all indexed occurrences of the term you only need to look at that subset that are "associated" with a given context. This should, hopefully, reduce the level of information overload we all suffer from at present.

The key to efficiency is going to be the mapping  between the semantics of the context determining elements (remembering that we are talking about a chain of ancestors for most elements) and the terms used in a natural language (or near natural language) query. Unless there is a close match between query semantics and markup semantics the results of the query will be meaningless.

The first question that needs to be asked is "how do users identify the contexts in which data is likely to be meaningful?" Take the example used in another thread "like "find a SPEECH whose SPEAKER contains 'hamlet'". What happens if I coded my text as <Hamlet>To be or not to be</Hamlet>? How do I know that the tag name identifies the speaker of a speech? Yet it obviously does - thats the whole intention of the tag. OK, so its a non-generalized DTD, bad practice. But what about <Part role="Hamlet">To be or not to be</Part>. Again how do I relate the tag to the query? 

Structured queries can only be generated accurately from knowledge of the DTD they are intended to query contents related to. Len Bullard hit the nail on the head. The first port of call is the namespace. The second is the DTD/schema for that namespace, and the third is the contents of elements coded using a specific element within the DTD. Queries need to be based on the elements defined for a particular namespace.

So lets try to write queries based on this, something like:

Find me occurences of the phrase "ABC DEF" within elements whose parents contain "ELEMENT-X" or "Attribute-Y" within "Namespace-Z"

Indexing for such a query will need to be based on a combination of Namespace, Context and Contents. Omitting any one of these components will make it impossible to efficiently search servers. To suggest that we might be able to map between the contexts in different namespaces is, I feel, going to be beyond what current systems will be able to provide. But we need to consider how it might be possible to do this.

The real key is how we break the query down into something that can be exchanged between servers. Something along the lines of:

<attribute name="role">Hamlet</attribute>
<content>ABC DEF</content>

might allow us to allow different servers to use different tools to select the data using differing engines.

Martin Bryan

This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS