xml-dev - [ANN] nux-1.2 release

[ANN] nux-1.2 release

[ Lists Home | Date Index | Thread Index ]

To: XML Developers List <xml-dev@lists.xml.org>
Subject: [ANN] nux-1.2 release
From: Wolfgang Hoschek <whoschek@lbl.gov>
Date: Wed, 25 May 2005 16:54:07 -0700

The nux-1.2 release has been uploaded to

     http://dsd.lbl.gov/nux/

Nux is an open-source Java XML toolset geared towards embedded use in  
high-throughput XML messaging middleware such as large-scale Peer-to- 
Peer infrastructures, message queues, publish-subscribe and  
matchmaking systems for Blogs/newsfeeds, text chat, data acquisition  
and distribution systems, application level routers, firewalls,  
classifiers, etc. It is not an XML database, and does not attempt to  
be one.


Changelog:

XQuery/XPath: Added optional fulltext search via Apache Lucene  
engine. Similar to Google search, it is easy to use, powerful,  
efficient and goes far beyond what can be done with standard XPath  
regular expressions and string manipulation functions. It is similar  
in intent but not directly related to preliminary W3C fulltext search  
drafts. Rather than targetting fulltext search of infrequent queries  
over huge persistent data archives (historic search), Nux targets  
fulltext search of huge numbers of queries over comparatively small  
transient realtime data (prospective search). See FullTextUtil and  
MemoryIndex.

Example fulltext XQuery that finds all books authored by James that  
have something to do with 'salmon fishing manuals', sorted by relevance:

declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
declare variable $query := "+salmon~ +fish* manual~";
(: any arbitrary Lucene query can go here :)
(: declare variable $query as xs:string external; :)
for $book in /books/book[author="James" and lucene:match(abstract,  
$query) > 0.0]
let $score := lucene:match($book/abstract, $query)
order by $score descending
return $book


Example fulltext XQuery that matches on extracted sentences:

declare namespace lucene = "java:nux.xom.pool.FullTextUtil";
for $book in /books/book
     for $s in lucene:sentences($book/abstract, 0)
         return
             if (lucene:match($s, "+salmon~ +fish* manual~") > 0.0)
             then normalize-space($s)
             else ()

It is designed to enable maximum efficiency for on-the-fly  
matchmaking combining structured and fuzzy fulltext search in  
realtime streaming applications such as XQuery based XML message  
queues, publish-subscribe systems for Blogs/newsfeeds, text chat,  
data acquisition and distribution systems, application level routers,  
firewalls, classifiers, etc.

Arbitrary Lucene fulltext queries can be run from Java or from XQuery/ 
XPath/XSLT via a simple extension function. The former approach is  
more flexible whereas the latter is more convenient. Lucene analyzers  
can split on whitespace, normalize to lower case for case  
insensitivity, ignore common terms with little discriminatory value  
such as "he", "in", "and" (stop words), reduce the terms to their  
natural linguistic root form such as "fishing" being reduced to  
"fish" (stemming), resolve synonyms/inflexions/thesauri (upon  
indexing and/or querying), etc. Also see Lucene Query Syntax as well  
as Query Parser Rules.

Background: The first prototype was put together over the weekend.  
The functionality worked just fine, except that it took ages to index  
and search text in a high-frequency environment. Subsequently I wrote  
a complete reimplementation of the Lucene interfaces and contributed  
that back to Lucene (the bits in org.apache.lucene.index.memory.*).  
Next, I placed a smart cache in front of it (the bits in  
nux.xom.pool.FullTextUtil / FullTextPool). The net effect is that  
fulltext queries over realtime data now run some three orders of  
magnitude faster while preserving the same general functionality  
(e.g. 100000-500000 queries/sec ballpark). In fact, you'll probably  
notice little or no overhead when adding fulltext search to your  
streaming apps. See MemoryIndexBenchmark and XQueryBenchmark.

Explore and enjoy, perhaps using the queries and sample data from the  
samples/fulltext directory as a starting point.

Wolfgang.

Prev by Date: Re: [xml-dev] TAG opinion on XML Binary Format
Next by Date: Oracle table to PDF using XML
Previous by thread: Re: [xml-dev] TAG opinion on XML Binary Format
Next by thread: Oracle table to PDF using XML
Index(es):
- Date
- Thread