OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: xml search engine?

[ Lists Home | Date Index | Thread Index ]
  • From: "Didier PH Martin" <martind@netfolder.com>
  • To: "David Megginson" <david@megginson.com>, <xml-dev@XML.ORG>
  • Date: Sat, 1 Apr 2000 13:04:49 -0500

Hi David,

David said:
Try something like "what free operating systems have mp3 support?" and
you have to wade through many more hits before you find useful

Didier replies:
This is the whole problem with recognizing the semantics out of a bunch of
text. This problem won't be resolved by XML except if we all agree on a
particular ontology. So I guess that until we have machines able to learn
enough information about the different contexts that we won't have better

By the way, always thinking about this query stuff, and putting things into
a particular context, let's imagine a scenario.

We have a big bunch of XML documents (but not based on the XHTML domain
language) posted on the web. We do a query (or a search if I am using Tim's
terminology) for a particular topic. The query is a simple string like
"hyperinflation". Then, the engine we submitted the query to scan its
indexes and return, let's say, an XML document containing an xlink:extended
element and this latter contains a collection of locators. A locator for
each resource related to the topic. So, this scenario is what we get used to
today, we do a query and get links to complete documents.

Now let's imagine a second scenario. We did a request like above but this
time, the search engine returns an XML document that contains all the
elements where the index engine found the "hyperinflation" string. In this
case, we got from a query elements not documents. So to speak, we got
document fragments.

Most of the time we choose scenario (a) instead of scenario (b) because we
do not know how the knowledge is encoded. Therefore, the engine returns the
whole document and let the receiver extract the right fragment or
information from the document. For instance, just the paragraph saying that
the hyperinflation phenomenon appeared in Germany after world war one and
was characterized by ...etc.. This text fragment may have been tagged with
the <definition about="hyperinflation">. This leads us to a third indexing

The hyperinflation string may be present in an attribute's value and the
search engine is not necessarily aware of this particular vocabulary and
therefore, not knowing that this is a definition tag. This, with the
exception however that the search engine knows about some human terms like
knowing what a definition is. Hence, by having stored a certain ontology,
the search engine can now determine that the <description> element probably
will provide some description about something. So, this last scenario leads
us to a new stemming mechanism. Extracting the information form tags and
discovering their meaning. Because the tag provides meta information about
the content, the engine can make a deduction that this document fragment is
maybe where the beef is. So, the engine deduced that we have a definition,
that this definition seems to be about hyperinflation and that even if the
word "hyperinflation" is not in the element's data content that the data
content is pertinent.

So, what could be the possible XML replies from a query engine.
a) obviously the whole document - like today the engine just return the
links - A good guy would return xlink:extended elements for each topic since
the engine basically returned a set of pointers to documents for each topic
requested. But I have serous doubts that the Yahoos, the Excite of this
world would do this. Where are they gonna put all these ads when they reply
with an XML document containing only the xlink:extended element? Naa, they
won't do it, this is against their business model. So, probably new guys
that invent a new business model will do that.
b) a document fragment - here with two scenarios. the first one is again an
xlink:extended element for each topic requested and inside each locator, the
text fragment containing either a summary of the document or the text
fragment where the engine "think" the topic is mentioned. The second one,
the engine found the topic in one of the indexed document's tags and made
the inference that the data content is about the topic requested.

In all cases, and because XML do not resolve the tower of Babel, in fact, it
just exacerbated it, the engine has to use some intelligence and make
deductions not from stemming keyword from a sentence but more by matching
these keywords with a know ontology and then deduce from it what the
information unit is all about. Obviously, the engine can also store very
popular document vocabularies and their meaning to do the same job. But I
guess that at the beginning we'll see some diversity or more simply that
we'll only move from HTML to XHTML (a very probable scenario for knowledge
encoding). In this last case, a <p> element still does not provide enough
information about its content. And to manually tag everything is too costly
for knowledge encoding. So, maybe if a nth generation XML authoring tool
provides concept recognition and automatic tagging, then we'll have better
marked up documents.

Didier PH Martin
Email: martind@netfolder.com
Conferences: Web Chicago(http://www.mfweb.com)
             XML Europe (http://www.gca.org)
Book: XML Professional (http://www.wrox.com)
column: Style Matters (http://www.xml.com)
Products: http://www.netfolder.com

This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS