[
Lists Home |
Date Index |
Thread Index
]
At 09:25 -0400 2005-08-23, Alan Gutierrez wrote:
>* Robert Koberg <rob@koberg.com> [2005-08-23 09:06]:
>> Hi,
>>
>> Someone on the Lucene user's list posted a link to this paper:
>
>>
>>http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-02-08/03-02-08.html
>
> > that talks about indexing and searching XML documents. I have been doing
>...
> Reading through the article, the thing that strikes me is that
> it that full text search of an XML document depends so much on
> the structure of the document. If that document can be divided
> into chapters, messages, articles, pages, etc, then it's best to
> create a full-text index with application specific documents.
I'm not quite sure what you mean by "depends so much on the structure
of the document". Certainly if you want to do searching that makes
use of the markup, that depends on the markup. But it seems like you
may be thinking something more like that search is so tied to the
details of a particular schema, or that it may be impractical to make
a generic search engine. If so, I disagree. There have been search
implementations that do a good job with generic XML.
I'm also puzzled by what you mean by "application specific
documents", and the part about "dividing" documents up. There are
many information management solutions that sadly force you to "chunk"
your information at a single level -- for example, a client of mine
asked me to sit in on meetings with another consulting firm, that
they had hired to index a lot of their XML information -- which was
organized hierarchically (as some reasonable % of XML data is, after
all). They had just run into the snag that the system they were using
(which I'll leave unnamed) could not really operate that way
(marketing literature notwithstanding). They were forced to pick one
single level (chapter, paragraph, section, or some such), and *only*
at that level could you:
* checkin/checkout
* search for co-occurrence or proximity of terms ("and", "near")
...etc...
If two things ended up in separate "chunks", as far as searching knew
they were in separate unrelated documents. If you wanted to be able
to sometimes search for terms co-occurring in the same paragraph, and
other times in the same section, forget it. Also, the cost to
reconstruct a whole document from its "chunks" was high -- though you
had to do that every time you wanted a whole document to export, or
print, or validate, or....
The other consulting firm was over a barrel because they had written
complicated "chunking" code to break the XML into the required
chunks, and schema revisions meant they had to rewrite all of that.
They really were trying hard (I was nice to them -- they were clearly
sweating a lot and realized the problem they had stuck themselves
with); the indexing tool they chose hamstrung them in a lot of ways
that were very hard to see at the start, but very painful once seen.
I mention this not because that system is unusual but because it
*isn't*. There are *many* indexing systems with just this kind of
behavior: They deal with exactly *one* level of structure. The
situation is really even worse. Think through *all* of the schema
you're dealing with. Are there footnotes, revision markup,
effectivity, hyperlinks...? Most schemas pose at least a few really
nasty problems for "chunk-style" indexing.
>
> So, perhaps, the scaleable solution, is full-text engine that
> is fed a XML documents, and a full-text indexing schema.
>
> The existing schema langauges like to atomize documents, while a
> full-text indexing schema might group their elements into
> concepts, like paths, links, articles, and clues for ranking
> articles based on conditions specified in XPath.
This is an interesting notion. Do you mean that existing *XML* schema
languages like to atomize, or that existing *indexer* "schemas" do?
It sounds like you're saying that XML schemas do, which seems to me
incorrect in the sense of "atomize" that matters here. XML schemas
give you not only "atoms," but a huge variety of complex "molecules"
and other structures. Many indexers, OTOH, *really* atomize: to the
extent they only deal with one kind of structure, despite the
diversity of reality.
As it is, most indexers *do* have an "indexing schema," though they
don't call it that, and it's hard-wired/unchangeable. It's commonly
fairly pathetic:
document ::= chunk+
It seems to me the problem isn't at the XML schema end. If our data
was structured the way many indexers *want* it to be, we could
trivially write XML schemas for that and trivially transform our
documents into it. But if you really do that, there isn't much
structural information left in your documents: and therefore the
indexers can't use it to advantage. We did put all that markup in
there for a reason, didn't we? I hope....
Indexing systems that took the actual XML schema seriously, might do
all you need. Are there things an ideal "indexing schema" would
include, that's not in the XML schema already? If so, that's a *very*
interesting topic to pursue, I think. And if so, which of those
things really *should* be in the XML to start with? Are they *really*
only useful for indexing? I rather doubt it. I contend that like
formatting information, indexing information should be derivable
*from* the XML markup. If an "indexing schema" isn't simply derivable
by rule from the existing markup, then the information isn't in the
input, right? Or at least, isn't explicit, which is what counts for
processing.
>
> I've wanted to explore the use of Lucene in my document object
> model, so I'd like to hear more about this.
There are many indexing solutions out there, many of them quite good
for what they do. I looked at Lucerne a long time ago and it seemed
pretty nice overall, though I've lost track of the details by now. If
I remember right, it did have to break things down pretty finely,
though it could do some kinds of searches across the chunks. That
approach tends to problems where searches get very complicated. For
example if you want to find X anywhere within elements of type T, you
may have to do a big OR to account for all the things that might be
in between: X in T or X in EMPH in T or X in P in T or X in EMPH in P
in T or X in fox in socks in T.... Otherwise you simply miss all the
cases you didn't mention. Or, there might be a single user search
command that does that easily for the user, but expands to the gory
"or" inside and gets real slow. Just be really careful in evaluating
whatever engines you look at.
It's not extremely hard to build a completely structure-aware indexer
(though optimizing them for really huge document collections is
harder). But they're still not common, and indexers that weren't
built specifically for XML from the beginning, often have many
surprises awaiting the unwary.
Best wishes,
Steve
PS: The "chunking" or single-level issue was a hot topic in hypertext
and information retrieval articles in the late 80's and early 90's,
and much of what was written then is, perhaps surprisingly, still
timely today. If inclined, check out the Proceedings of the yearly
ACM "Hypertext" conferences. Also some of this was discussed during
the W3C "QL98" conference in Cambridge (that kicked off the W3C work
on querying), available at http://www.w3.org/TandS/QL/QL98/. Of
course the first thing you'll want to read from there is my paper...
;) http://www.w3.org/TandS/QL/QL98/pp/linkhier.html And as always,
the Cover Pages have a wealth of good info, for example at
http://xml.coverpages.org/xmlQuery.html
--
Luthien Consulting: Real solutions to hard information management problems
Specializing in XML, schema design, XSLT, and project design/review/repair
Steven J. DeRose, Ph.D., sderose@acm.org
|