Re: [xml-dev] XML aggregation question?

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: peter murray-rust <pm286@xxxxxxxxx>
To: XML Developers List <xml-dev@xxxxxxxxxxxxx>
Date: Sat, 26 Aug 2006 11:16:46 +0100
At 10:36 26/08/2006, Andrew S. Townley wrote:
>Hi Folks,
>
>I'm looking for some collective wisdom from the list on how what I want
>can be done using only XML technologies.  I know at least 3 different
>ways you could do it using databases of various sorts, but I'm trying to
>see if there's a better way, or if the RDBMS is the way to go.
>
>What I'm trying to do is dynamically aggregate information from XML
>instance documents without having to process all of the instances every
>time I want the aggregate.  Maybe this is a job for an XMLDB, but I'm
>not terribly familiar with them.  I'd also like to be able to keep my
>XML instance documents stored on the filesystem rather than having them
>in a database for easy access from a variety of tools, from text editors
>to Web servers to other utilities written in various languages.

This is close to what we are also trying to do for our XML files and 
with some initial success. We create compound documents (with 
potentially multiple namespaces including CML) whose precise 
vocabulary and structure may not always be known. We do not want the 
hassle of creating an RDB schema which will probably not be constant 
and which can be very complex.

I was influenced by Ron Bourret's very useful review of XML databases 
- which was at least 5 years old. It was very clear that what you use 
is determined by the nature of your documents and your problem. Is 
the database "nearly read-only" - e.g. a document archive that you 
want to query and add new "static" documents to - or are the entries 
themselves frequently updated. Do you want to query on a fixed set of 
vocabulary (e.g. ids, titles, etc.) or do you want the full power of 
XQuery. I think he used the terms docucentric and datacentric. He 
reviewed ca 50 possible tools, some of which were commercial. I do 
not know whether his analysis is still updated.

so we have looked at XML databases such as Xindice, eXist and XMLDB 
(XML on Berkeley DB). Originally we used Xindice and it managed 250, 
000 documents of several hundred nodes each. However each attribute 
or element used in the query had to be indexed and the tables were 
huge (and mainly sparse). It was clear that there was considerable 
tuning required. My current impression is that Xindice is in 
hibernation. Any comments?

We then moved to eXist. I is trivial to install and comes with Jon 
Bosak's Shakespeare and some preformed queries. I like eXist. I use 
this for teaching students in MSc bioinformatics. I simply say "go 
and find eXist, download it, and load in some XML genomes".  Within 
30 minutes they have a working application that can be queried for a 
number of concepts. Recently we put in 1000 documents with ca 1000 
nodes each and it gives sub second response for simple queries. We 
have taken this up to 10,000 documents in the past. About 2 years ago 
we tried to put in 250,000 documents and it failed, but we believe it 
has been significantly improved.

Some of my collaborators think XML-DB would be better and they will 
be trying it out. Our solutions have to be Open Source as we wish to 
package this in a universal Open toolkit for chemistry. In general 
our applications are not mission critical and, for instance, and 
update might consist of retrieving an entry, deleting the original, 
modifying the entry and re-importing it. We also like the idea of 
externalizing the system to a hierarchical filestore.

I'd be interested in other experiences, especially about scale to 
medium-sized docucentric apps - millions but not billions of nodes.

P.


>Given something like a widget in an inventory or workflow system where
>each instance represents a given widget, e.g.
>
><widget>
>   <status>XXX</status>
>   ...
></widget>
>
>What I would like to be able to do is get a view of the collective
>status of my group of widgets in an on-demand manner.  Other processes
>may be changing the status, so I don't want to introduce a dependency on
>an application-maintained static index updated when the status changes.
>
>As I said, some of the ways I know are possible are:
>
>1 - Move the data from the XML instances into a database and run
>queries.  When I need the data, either re-generate the XML or store the
>XML as a blob.  Obviously, need to do everything in the database or use
>ETL operations to do updates.
>
>2 - Keep the XML on the filesystem and periodically (via cron or
>similar) generate a static index based on the as-is state of the
>information.  Aggregate info is only guaranteed to be as fresh as the
>last batch job.  This also has the problem of not scaling well as the
>number of instances increases.
>
>3 - Provide a centralized persistence layer to essentially do what #1 is
>doing, but as the XML is modified, update the static index.  This seems
>really cumbersome and error prone, plus it means you can't have the
>flexibility of accessing the "raw" instance documents with shell
>scripts, for example.
>
>I'm sure I'm missing something obvious here, so any pointers/suggestions
>would be appreciated.  This has to be a common pattern, so I'm sure
>there are other solutions people have come up with.  My goal is to keep
>whatever solution as light as possible, but if I have to build or use
>infrastructure, then that's what I'll have to do.
>
>Thanks in advance,
>
>ast
>--
>Andrew S. Townley <ast@atownley.org>
>http://atownley.org

Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road,  Cambridge CB2 1EW, UK
+44-1223-763069
Follow-Ups:
- Re: [xml-dev] XML aggregation question?
  - From: Robin Berjon <robin.berjon@expway.fr>
References:
- XML aggregation question?
  - From: "Andrew S. Townley" <ast@atownley.org>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]