Re: [xml-dev] What is the general direction you are seeing these daysto

Your comments make sense and I was aiming primarily at Rogers problem to the extent we really know anything about it. However, I will observe that Graph databases give you some very nice capabilities for doing curation and data management (as well as analytics):

- edges can have properties that can be used in queries, for example weighted edges, labels for state, authorizations, etc;

- you can have hyper graphs that give you various degrees of type or classification, meta data, and normalization; These can be built dynamically...

- you can have 1 way relationships, so billions of things can be users of Facebook, but a poorly structured query isn't going to traverse these billions of things just because of this relationship (in other words, there are ways of managing super nodes, which is critical for hyper graph models)

Caveat; not all implementations give you all of these...

So for me, it's no longer a question of having analytics or data management or even of OLTP and OLAP. I can have all four in one database, and data management and analytics can iteratively imposed upon a dirty data set using interactive queries and updates, Hadoop type "batch" processing, and things like Apache Spark which let's you do both interactive and Hadoop. Note, there's a subtle implication in the above: you can write analytics data back into the graph as you process it. Interactive updates are run against close neighbors and distant neighbors are hit with batch; the CAP theorem still applies but it's impact is mitigated

Again, the real answer is still business case specific, but for me the idea of using tools other than these for big data no longer makes sense if the business problem truly demands analysis of multiple terabytes, (least of all more)...

Peter Hunsberger

On Wed, Mar 11, 2015 at 7:11 AM, Ihe Onwuka <ihe.onwuka@gmail.com> wrote:

Well I am very grateful for Peter's comments and intrigued intrigued and enthused enough to investigate the architectures he describes for my current project. Part of the reason relates to an additional shortcoming I found with eXist - the insistence on a strictly hierarchical collection structure. This forces me into choice I do not want to have to make for a movie repository - whether a rom/com goes in the romantic collection or the comedy collecton - and of course genre is not the only facet worthy of modelling collections on. How well the product would support a very large number of collections with each piece of data being able to belong to several is a concern that I wouldn't have in a graph based architecture.

I find

Peter made a very interesting assertion:

The analytics should run directly on the data,

not on some extract.

less persuasive. First up it depends on where you get your data from. If it originated as HTML then marshalling it into anything but XML usually entails the (almost certainly) premature imposition of some sort of schema.

Secondly this somewhat overlooks the significant data management effort latent in most Big Data projects. At the very least that amount of data will usually have to go through a significant cleansing process. A very vocal section of the analytics community seem to think this is yet another thing they can do with an R library. Rarely is a dissenting voice ever heard yet the no 1 lament of the very same people lament is the time and effort spent on dealing with unclean data.

We (software development people) acquire a data management capability (Oracle etc ) and build BI and/or analytics tools on top of that. They choose their analytics capability and then find they have to bolt a data management function on top of it. Well I think they've got it arse about face. Doing data management with your analytics tool is just as big (if not a bigger) sin as doing analytics with your data management tool.