[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] XML aggregation question?
- From: peter murray-rust <pm286@xxxxxxxxx>
- To: XML Developers List <xml-dev@xxxxxxxxxxxxx>
- Date: Sat, 26 Aug 2006 11:16:46 +0100
At 10:36 26/08/2006, Andrew S. Townley wrote:
>Hi Folks,
>
>I'm looking for some collective wisdom from the list on how what I want
>can be done using only XML technologies. I know at least 3 different
>ways you could do it using databases of various sorts, but I'm trying to
>see if there's a better way, or if the RDBMS is the way to go.
>
>What I'm trying to do is dynamically aggregate information from XML
>instance documents without having to process all of the instances every
>time I want the aggregate. Maybe this is a job for an XMLDB, but I'm
>not terribly familiar with them. I'd also like to be able to keep my
>XML instance documents stored on the filesystem rather than having them
>in a database for easy access from a variety of tools, from text editors
>to Web servers to other utilities written in various languages.
This is close to what we are also trying to do for our XML files and
with some initial success. We create compound documents (with
potentially multiple namespaces including CML) whose precise
vocabulary and structure may not always be known. We do not want the
hassle of creating an RDB schema which will probably not be constant
and which can be very complex.
I was influenced by Ron Bourret's very useful review of XML databases
- which was at least 5 years old. It was very clear that what you use
is determined by the nature of your documents and your problem. Is
the database "nearly read-only" - e.g. a document archive that you
want to query and add new "static" documents to - or are the entries
themselves frequently updated. Do you want to query on a fixed set of
vocabulary (e.g. ids, titles, etc.) or do you want the full power of
XQuery. I think he used the terms docucentric and datacentric. He
reviewed ca 50 possible tools, some of which were commercial. I do
not know whether his analysis is still updated.
so we have looked at XML databases such as Xindice, eXist and XMLDB
(XML on Berkeley DB). Originally we used Xindice and it managed 250,
000 documents of several hundred nodes each. However each attribute
or element used in the query had to be indexed and the tables were
huge (and mainly sparse). It was clear that there was considerable
tuning required. My current impression is that Xindice is in
hibernation. Any comments?
We then moved to eXist. I is trivial to install and comes with Jon
Bosak's Shakespeare and some preformed queries. I like eXist. I use
this for teaching students in MSc bioinformatics. I simply say "go
and find eXist, download it, and load in some XML genomes". Within
30 minutes they have a working application that can be queried for a
number of concepts. Recently we put in 1000 documents with ca 1000
nodes each and it gives sub second response for simple queries. We
have taken this up to 10,000 documents in the past. About 2 years ago
we tried to put in 250,000 documents and it failed, but we believe it
has been significantly improved.
Some of my collaborators think XML-DB would be better and they will
be trying it out. Our solutions have to be Open Source as we wish to
package this in a universal Open toolkit for chemistry. In general
our applications are not mission critical and, for instance, and
update might consist of retrieving an entry, deleting the original,
modifying the entry and re-importing it. We also like the idea of
externalizing the system to a hierarchical filestore.
I'd be interested in other experiences, especially about scale to
medium-sized docucentric apps - millions but not billions of nodes.
P.
>Given something like a widget in an inventory or workflow system where
>each instance represents a given widget, e.g.
>
><widget>
> <status>XXX</status>
> ...
></widget>
>
>What I would like to be able to do is get a view of the collective
>status of my group of widgets in an on-demand manner. Other processes
>may be changing the status, so I don't want to introduce a dependency on
>an application-maintained static index updated when the status changes.
>
>As I said, some of the ways I know are possible are:
>
>1 - Move the data from the XML instances into a database and run
>queries. When I need the data, either re-generate the XML or store the
>XML as a blob. Obviously, need to do everything in the database or use
>ETL operations to do updates.
>
>2 - Keep the XML on the filesystem and periodically (via cron or
>similar) generate a static index based on the as-is state of the
>information. Aggregate info is only guaranteed to be as fresh as the
>last batch job. This also has the problem of not scaling well as the
>number of instances increases.
>
>3 - Provide a centralized persistence layer to essentially do what #1 is
>doing, but as the XML is modified, update the static index. This seems
>really cumbersome and error prone, plus it means you can't have the
>flexibility of accessing the "raw" instance documents with shell
>scripts, for example.
>
>I'm sure I'm missing something obvious here, so any pointers/suggestions
>would be appreciated. This has to be a common pattern, so I'm sure
>there are other solutions people have come up with. My goal is to keep
>whatever solution as light as possible, but if I have to build or use
>infrastructure, then that's what I'll have to do.
>
>Thanks in advance,
>
>ast
>--
>Andrew S. Townley <ast@atownley.org>
>http://atownley.org
Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road, Cambridge CB2 1EW, UK
+44-1223-763069
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]