OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Searching XML

Phil Ruelle wrote:

> Each document will be an order containing a list of items. The most 
> frequent type of search would be to select orders that:
> a) contain item X
> b) contain item X and item Y
> c) contain item X but do not contain item Y

Given that searches are this simple (as I had suspected), I
would recommend bypassing XML parsing altogether and just doing
plain string searches.  You will have a huge gain in speed,
space, and simplicity.

As I have posted here before, XML is a document format, not a
processing model.

> I am intrigued by your question about search complexity - I hadn't 
> really considered its effect. Could you point me to some references 
> that explain what the options are and how they alter according to 
> complexity or could you possibly expand a little on this yourself?

If you want searches that involve context ("Find all foo's with content
bar that are inside baz elements") then you pretty much need an XPath
implementation.  But if you don't care about context, the aforementioned
string search is very reasonable.

We process a lot of XML documents here by comparing them to about 100
files containing about 200-5000 search terms each.  The simplest
approach is just the string search, and it saves us from having to
parse a single one of those documents.

>>Also, what operating environment are you using?
> I'm using a desktop with Windows 98 and NT on it. Also I intend to 
> program in Java if only due to the vast amount of XML code 
> available.

Grab yourself an implementation of "fgrep" for your platform,
and don't program anything at all!

There is / one art             || John Cowan <jcowan@reutershealth.com>
no more / no less              || http://www.reutershealth.com
to do / all things             || http://www.ccil.org/~cowan
with art- / lessness           \\ -- Piet Hein