[
Lists Home |
Date Index |
Thread Index
]
* Robert Koberg <rob@koberg.com> [2005-08-23 09:06]:
> Hi,
>
> Someone on the Lucene user's list posted a link to this paper:
> http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-02-08/03-02-08.html
> that talks about indexing and searching XML documents. I have been doing
> something similar for a while (3 years, I think) but it is specific to
> our configuration/content which probably doesn't have wider
> applicability. I have also found it to be:
> "a fast, reliable XML search engine, which has exceeded our expectations
> in terms of flexibility and low development cost."
> I was thinking the article would be of interest to many people here. I
> was also wondering about your thoughts on this method of dealing with
> XML. I have not looked in depth at XQuery, and I am wondering what
> strengths/benefits XQuery would have over using something like Lucene to
> index/query XML.
> It would be interesting to see what folk from this list would come up
> with if they put their brains to work on ways to handle
> indexing/searching with something like Lucene.
Len was in a thread a while back, on Web 2.0, where I posited
the notion of a REST interface to full text search of syndicated
feeds, or blogs.
While we're at it, Len, did you think about that any further?
Reading through the article, the thing that strikes me is that
it that full text search of an XML document depends so much on
the structure of the document. If that document can be divided
into chapters, messages, articles, pages, etc, then it's best to
create a full-text index with application specific documents.
So, perhaps, the scaleable solution, is full-text engine that
is fed a XML documents, and a full-text indexing schema.
The existing schema langauges like to atomize documents, while a
full-text indexing schema might group their elements into
concepts, like paths, links, articles, and clues for ranking
articles based on conditions specified in XPath.
I've wanted to explore the use of Lucene in my document object
model, so I'd like to hear more about this.
--
Alan Gutierrez - alan@engrm.com
- http://engrm.com/blogometer/index.html
- http://engrm.com/blogometer/rss.2.0.xml
|