OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] indexing and querying XML (not XQuery)

[ Lists Home | Date Index | Thread Index ]

At 09:25 -0400 2005-08-23, Alan Gutierrez wrote:
>* Robert Koberg <rob@koberg.com> [2005-08-23 09:06]:
>>  Hi,
>>  Someone on the Lucene user's list posted a link to this paper:
>  > that talks about indexing and searching XML documents. I have been doing
>     Reading through the article, the thing that strikes me is that
>     it that full text search of an XML document depends so much on
>     the structure of the document. If that document can be divided
>     into chapters, messages, articles, pages, etc, then it's best to
>     create a full-text index with application specific documents.

I'm not quite sure what you mean by "depends so much on the structure 
of the document". Certainly if you want to do searching that makes 
use of the markup, that depends on the markup. But it seems like you 
may be thinking something more like that search is so tied to the 
details of a particular schema, or that it may be impractical to make 
a generic search engine. If so, I disagree. There have been search 
implementations that do a good job with generic XML.

I'm also puzzled by what you mean by "application specific 
documents", and the part about "dividing" documents up. There are 
many information management solutions that sadly force you to "chunk" 
your information at a single level -- for example, a client of mine 
asked me to sit in on meetings with another consulting firm, that 
they had hired to index a lot of their XML information -- which was 
organized hierarchically (as some reasonable % of XML data is, after 
all). They had just run into the snag that the system they were using 
(which I'll leave unnamed) could not really operate that way 
(marketing literature notwithstanding). They were forced to pick one 
single level (chapter, paragraph, section, or some such), and *only* 
at that level could you:

    * checkin/checkout
    * search for co-occurrence or proximity of terms ("and", "near")

If two things ended up in separate "chunks", as far as searching knew 
they were in separate unrelated documents. If you wanted to be able 
to sometimes search for terms co-occurring in the same paragraph, and 
other times in the same section, forget it. Also, the cost to 
reconstruct a whole document from its "chunks" was high -- though you 
had to do that every time you wanted a whole document to export, or 
print, or validate, or....

The other consulting firm was over a barrel because they had written 
complicated "chunking" code to break the XML into the required 
chunks, and schema revisions meant they had to rewrite all of that. 
They really were trying hard (I was nice to them -- they were clearly 
sweating a lot and realized the problem they had stuck themselves 
with); the indexing tool they chose hamstrung them in a lot of ways 
that were very hard to see at the start, but very painful once seen.

I mention this not because that system is unusual but because it 
*isn't*. There are *many* indexing systems with just this kind of 
behavior: They deal with exactly *one* level of structure. The 
situation is really even worse. Think through *all* of the schema 
you're dealing with. Are there footnotes, revision markup, 
effectivity, hyperlinks...? Most schemas pose at least a few really 
nasty problems for "chunk-style" indexing.

>     So, perhaps, the scaleable solution, is full-text engine that
>     is fed a XML documents, and a full-text indexing schema.
>     The existing schema langauges like to atomize documents, while a
>     full-text indexing schema might group their elements into
>     concepts, like paths, links, articles, and clues for ranking
>     articles based on conditions specified in XPath.

This is an interesting notion. Do you mean that existing *XML* schema 
languages like to atomize, or that existing *indexer* "schemas" do? 
It sounds like you're saying that XML schemas do, which seems to me 
incorrect in the sense of "atomize" that matters here. XML schemas 
give you not only "atoms," but a huge variety of complex "molecules" 
and other structures. Many indexers, OTOH, *really* atomize: to the 
extent they only deal with one kind of structure, despite the 
diversity of reality.

As it is, most indexers *do* have an "indexing schema," though they 
don't call it that, and it's hard-wired/unchangeable. It's commonly 
fairly pathetic:

      document ::= chunk+

It seems to me the problem isn't at the XML schema end. If our data 
was structured the way many indexers *want* it to be, we could 
trivially write XML schemas for that and trivially transform our 
documents into it. But if you really do that, there isn't much 
structural information left in your documents: and therefore the 
indexers can't use it to advantage. We did put all that markup in 
there for a reason, didn't we? I hope....

Indexing systems that took the actual XML schema seriously, might do 
all you need. Are there things an ideal "indexing schema" would 
include, that's not in the XML schema already? If so, that's a *very* 
interesting topic to pursue, I think. And if so, which of those 
things really *should* be in the XML to start with? Are they *really* 
only useful for indexing? I rather doubt it. I contend that like 
formatting information, indexing information should be derivable 
*from* the XML markup. If an "indexing schema" isn't simply derivable 
by rule from the existing markup, then the information isn't in the 
input, right? Or at least, isn't explicit, which is what counts for 

>     I've wanted to explore the use of Lucene in my document object
>     model, so I'd like to hear more about this.

There are many indexing solutions out there, many of them quite good 
for what they do. I looked at Lucerne a long time ago and it seemed 
pretty nice overall, though I've lost track of the details by now. If 
I remember right, it did have to break things down pretty finely, 
though it could do some kinds of searches across the chunks. That 
approach tends to problems where searches get very complicated. For 
example if you want to find X anywhere within elements of type T, you 
may have to do a big OR to account for all the things that might be 
in between: X in T or X in EMPH in T or X in P in T or X in EMPH in P 
in T or X in fox in socks in T.... Otherwise you simply miss all the 
cases you didn't mention. Or, there might be a single user search 
command that does that easily for the user, but expands to the gory 
"or" inside and gets real slow. Just be really careful in evaluating 
whatever engines you look at.

It's not extremely hard to build a completely structure-aware indexer 
(though optimizing them for really huge document collections is 
harder). But they're still not common, and indexers that weren't 
built specifically for XML from the beginning, often have many 
surprises awaiting the unwary.

Best wishes,


PS: The "chunking" or single-level issue was a hot topic in hypertext 
and information retrieval articles in the late 80's and early 90's, 
and much of what was written then is, perhaps surprisingly, still 
timely today. If inclined, check out the Proceedings of the yearly 
ACM "Hypertext" conferences. Also some of this was discussed during 
the W3C  "QL98" conference in Cambridge (that kicked off the W3C work 
on querying), available at http://www.w3.org/TandS/QL/QL98/. Of 
course the first thing you'll want to read from there is my paper... 
;) http://www.w3.org/TandS/QL/QL98/pp/linkhier.html  And as always, 
the Cover Pages have a wealth of good info, for example at 

Luthien Consulting: Real solutions to hard information management problems
    Specializing in XML, schema design, XSLT, and project design/review/repair
Steven J. DeRose, Ph.D., sderose@acm.org


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS