OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   SAX, DOM, and Search Engines (was Re: xml parser)

[ Lists Home | Date Index | Thread Index ]
  • From: <david@megginson.com>
  • To: <xml-dev@ic.ac.uk>
  • Date: Wed, 4 Nov 1998 17:32:41 -0500 (EST)

Tim Bray writes:

 > At 10:55 AM 11/4/98 -0000, Michael Kay wrote:
 > >My immediate answer to this is yes, all the information you need for a
 > >search engine is available via the SAX or DOM interface offered by many
 > >parsers.
 > I disagree.  Few parsers track byte offsets or other locational info in
 > the file, and I think you need that to do basic things like proximity
 > and phrase search.

I disagree.  While byte offsets might be useful for other purposes,
they would be inappropriate for proximity and phrase searches -- for
those, you need to track the relative positions of words, not their
absolute positions.  Consider the following example:

  <p>WORD1 &x; WORD2</p>

Is WORD1 close to WORD2?  It's only five bytes away (assuming an 8-bit
encoding), but might be separated by 20,000 words, depending on what
&x; expands to.  SAX and the DOM do give you enough information to
determine the relative positions of words.

Byte offsets would be helpful for displaying context around a match,
but there would be no 100% reliable way to format that context without
starting from the top of the document, in which case an XPOINTER (also
derivable from SAX or DOM) might be more helpful unless you want the
search engine to display raw XML markup for the context.

All the best,


David Megginson                 david@megginson.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS