Well I am very grateful for Peter's comments and intrigued intrigued and enthused enough to investigate the architectures he describes for my current project. Part of the reason relates to an additional shortcoming I found with eXist - the insistence on a strictly hierarchical collection structure. This forces me into choice I do not want to have to make for a movie repository - whether a rom/com goes in the romantic collection or the comedy collecton - and of course genre is not the only facet worthy of modelling collections on. How well the product would support a very large number of collections with each piece of data being able to belong to several is a concern that I wouldn't have in a graph based architecture.I findPeter made a very interesting assertion:
The analytics should run directly on the data,
not on some extract.
less persuasive. First up it depends on where you get your data from. If it originated as HTML then marshalling it into anything but XML usually entails the (almost certainly) premature imposition of some sort of schema.Secondly this somewhat overlooks the significant data management effort latent in most Big Data projects. At the very least that amount of data will usually have to go through a significant cleansing process. A very vocal section of the analytics community seem to think this is yet another thing they can do with an R library. Rarely is a dissenting voice ever heard yet the no 1 lament of the very same people lament is the time and effort spent on dealing with unclean data.
We (software development people) acquire a data management capability (Oracle etc ) and build BI and/or analytics tools on top of that. They choose their analytics capability and then find they have to bolt a data management function on top of it. Well I think they've got it arse about face. Doing data management with your analytics tool is just as big (if not a bigger) sin as doing analytics with your data management tool.