[
Lists Home |
Date Index |
Thread Index
]
The nux-1.2 release has been uploaded to
http://dsd.lbl.gov/nux/
Nux is an open-source Java XML toolset geared towards embedded use in
high-throughput XML messaging middleware such as large-scale Peer-to-
Peer infrastructures, message queues, publish-subscribe and
matchmaking systems for Blogs/newsfeeds, text chat, data acquisition
and distribution systems, application level routers, firewalls,
classifiers, etc. It is not an XML database, and does not attempt to
be one.
Changelog:
XQuery/XPath: Added optional fulltext search via Apache Lucene
engine. Similar to Google search, it is easy to use, powerful,
efficient and goes far beyond what can be done with standard XPath
regular expressions and string manipulation functions. It is similar
in intent but not directly related to preliminary W3C fulltext search
drafts. Rather than targetting fulltext search of infrequent queries
over huge persistent data archives (historic search), Nux targets
fulltext search of huge numbers of queries over comparatively small
transient realtime data (prospective search). See FullTextUtil and
MemoryIndex.
Example fulltext XQuery that finds all books authored by James that
have something to do with 'salmon fishing manuals', sorted by relevance:
declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
declare variable $query := "+salmon~ +fish* manual~";
(: any arbitrary Lucene query can go here :)
(: declare variable $query as xs:string external; :)
for $book in /books/book[author="James" and lucene:match(abstract,
$query) > 0.0]
let $score := lucene:match($book/abstract, $query)
order by $score descending
return $book
Example fulltext XQuery that matches on extracted sentences:
declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
for $book in /books/book
for $s in lucene:sentences($book/abstract, 0)
return
if (lucene:match($s, "+salmon~ +fish* manual~") > 0.0)
then normalize-space($s)
else ()
It is designed to enable maximum efficiency for on-the-fly
matchmaking combining structured and fuzzy fulltext search in
realtime streaming applications such as XQuery based XML message
queues, publish-subscribe systems for Blogs/newsfeeds, text chat,
data acquisition and distribution systems, application level routers,
firewalls, classifiers, etc.
Arbitrary Lucene fulltext queries can be run from Java or from XQuery/
XPath/XSLT via a simple extension function. The former approach is
more flexible whereas the latter is more convenient. Lucene analyzers
can split on whitespace, normalize to lower case for case
insensitivity, ignore common terms with little discriminatory value
such as "he", "in", "and" (stop words), reduce the terms to their
natural linguistic root form such as "fishing" being reduced to
"fish" (stemming), resolve synonyms/inflexions/thesauri (upon
indexing and/or querying), etc. Also see Lucene Query Syntax as well
as Query Parser Rules.
Background: The first prototype was put together over the weekend.
The functionality worked just fine, except that it took ages to index
and search text in a high-frequency environment. Subsequently I wrote
a complete reimplementation of the Lucene interfaces and contributed
that back to Lucene (the bits in org.apache.lucene.index.memory.*).
Next, I placed a smart cache in front of it (the bits in
nux.xom.pool.FullTextUtil / FullTextPool). The net effect is that
fulltext queries over realtime data now run some three orders of
magnitude faster while preserving the same general functionality
(e.g. 100000-500000 queries/sec ballpark). In fact, you'll probably
notice little or no overhead when adding fulltext search to your
streaming apps. See MemoryIndexBenchmark and XQueryBenchmark.
Explore and enjoy, perhaps using the queries and sample data from the
samples/fulltext directory as a starting point.
Wolfgang.
|