xml-dev - Re: xml search engine?

Re: xml search engine?
[ Lists Home | Date Index | Thread Index ]
From: Peter Murray-Rust <peter@ursus.demon.co.uk>
To: xml-dev@XML.ORG
Date: Wed, 29 Mar 2000 10:46:54 +0100
This is an area I am very involved in and hope that the following answer is
some use.

At 11:43 AM 3/29/00 +0200, Reinout van Rees wrote:
>On Tue, 28 Mar 2000, Jean Marc VANEL wrote:
>
>
>There is a problem I see for xml search engines. How are they going to
>cope with all the various DTD's? They ARE going to cope, but what will
>be the result? Will we have lots of small search engines searching for
>information in all reinforced_concrete_supplier.dtd xml files it can
>find and another for all medicine.dtd info? Will there be a few
>standard elements in most DTD's to comply to some emerging behaviour
>of all search engines? There are so many ways this could work out. Any
>opinions? 

We have sometimes discussed on XML-DEV whether it is possible to have
schemas such that they are completely machine-interpretable. [I use this to
mean "If my machine gets a *.xml +*.xsd from some other machine and there
is no prior agreement, can my machine do something useful with the *.xml
(other than print it out for a human to read).] The general consensus was
that this was an ultimate goal but probably beyond most XML-ers immediate
vision. Therefore there has to be some prior agreement about semantics and
ontology.

I am intimated involved with "medicine.dtd" on two fronts. [I don't suggest
the discussion wanders into the details - I use it as an example]. I have
been compiling XML versions of drugs and diseases in conjunction with
expert centres in the field. There is no universal "medicine.dtd" and
unlikely to be one. It is more likely that there will be several
approaches, including HL7, MEDLINE and UMLS metathesaurus and others. These
will probably all evolve to have an XML interface. It will depend largely
on how the systems are currently deployed and users will need to know the
details of the organisation of each. [XML isn't magic, it can be a useful
wrapper for existing approaches]. In general these resources consist of
human-readable information and the search engine will have to know how this
is organised.


I have also developed CML (Chemical Markup Language), which is now starting
to become standard. I am working on making portable semantics, especially
through a Java-based CML-DOM. The attraction of this is it formalises the
semantics in a non-arbitrary way - no-one can argue that a DOM is a
non-standard approach. IOW, having developed the DTD for a technical
discipline, then implementation of a DOM is IMO almost mandatory. It is
also extremely good discipline because it makes it clear that every element
in the DTD and every attribute may have to have some code written. Because
of the labour of doing this I would hope that people collaborate on a
communal DOM (mine will be OpenSource) and in this way we shall not get
mutant versions.

The DOM necessarily defines the semantics and sometimes hardcodes ontology
(through behaviour). In this way we move towards a standard way of doing
things in a discipline. This may not be the "best" way, but it is likely to
fly. Therefore I would expect most chemical semantics to depend on the DOM.
It may be that the DOM exposes a "search" interface and I hope it does or
will (DOM3 people? aren't we discussing this at present :-)

This means that CML will become a component wherever standard chemistry is
involved. CML is an 80/20 solution to chemistry, has been submitted to the
governing body of chemistry (IUPAC) and hopefully will be used in a wide
range of documents. I envisage at least patents, safety, drugs,
publications, bioinformatics, medicine, materials, etc. 

I will be able to ask a question like:

"does this document contain any elements in a namespace mapped onto the URI
http://www.xml-cml.org?"

If so, they can only be *valid* if they conform to the CML DTD. In that
case we could ask a query (in XQL-like syntax):

"find all molecules with more than 20 carbon atoms:"

//molecule/atom/builtin[@elementType='C'][position()=21]

This is incredibly powerful. If, however we want even more power we could
write extension functions. Of course the community has to know what these
are and they could find all molecules with aromatic rings, all those with
electrons calculated to have particular energies (this could be done on the
fly as part of the search!). 

I have the honour of having been asked to help on the construction of the
Materials Markup language (MatML) run by Ed Begley at NIST. This may well
involve concrete. In any case we are tackling exactly these questions -
creating a *simple* approach, linked to terminologies and interoperable
with several other MLs.

	P.




***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
References:
- xml search engine?
  - From: Reinout van Rees <rr@cti036.citg.tudelft.nl>
Prev by Date: xml search engine?
Next by Date: xml spec 1.0 validity constraint for ID/IDREF
Previous by thread: xml search engine?
Next by thread: Re: xml search engine?
Index(es):
- Date
- Thread