Re: [xml-dev] What is the general direction you are seeing these daysto

Well it's not exactly like it's hard to get data out of Casandra, Neo4J, Titan or any relational database you might choose. In fact I'd argue it's easier to build a generic services oriented endpoint on top of those then on top of 50 million XML files.... Both Titan and Neo4j have JSON as native output and there is, for example, a SPARQL plugin for NEO4J and similarly Rexster for Titan will give you a SPARQL endpoint (should that be your flavor of the month).

However, once more I'll emphasize that it seems very likely that the proper way to manage this is a big data problem using the tools designed for that and not to aim at feeding tool sets designed for other problems. The analytics should run directly on the data, not on some extract. Recently I saw a complaint about Neo4J taking 7 minutes to traverse a billion nodes and the developers wanted to see diagnostics to figure out why things were taking so long, These tools are designed for finding answers hidden within very, very large data sets. As a very rough guess, assuming some degree of normalization is possible, Rogers entire data set might equate to something like 2 to 3 billion nodes and 3 to 5 billion edges which would be manageable in a small Titan cluster. Titan has been used with graphs of 100 billion edges...

Peter Hunsberger

On Mon, Mar 9, 2015 at 8:14 PM, Ihe Onwuka <ihe.onwuka@gmail.com> wrote:

---------- Forwarded message ----------
From: Ihe Onwuka <ihe.onwuka@gmail.com>
Date: Mon, Mar 9, 2015 at 9:11 PM
Subject: Re: [xml-dev] What is the general direction you are seeing these days to store and query lots of large complex XML?
To: Peter Hunsberger <peter.hunsberger@gmail.com>, "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>

I'm not in disagreement. I would not do serious analytics in XSLT/XQuery either but it's a hedge so that when the client decide to switch from sparse matrix multiplication in SQL to something more domain specific like Mathematica you are confident you can construct a feed to service that.

On Mon, Mar 9, 2015 at 8:43 PM, Peter Hunsberger <peter.hunsberger@gmail.com> wrote:
Umm, no, or rather most emphatically; NO! XML is a poor mans graph at best. For flat data (which it sounds like this mostly is) XML makes even less sense. But let's consider the more complex case: real graph traversal algorithms come pre-built for things like Neo4J and Titan and things like Gremlin beat the heck out of xPath, XSLT, xQuery, et al (and I'm an Apache Cocoon committer so I do believe in using those for the right problem!). Titan wasn't considered possible when XSLT was first conceived, the state of the art has progressed considerably since then. Graph databases, Hadoop and it's related infrastructure aren't the flavor of the month and are not going anywhere. They package up entire generations of Computer Sciences best practices into well thought out, incredibly powerful, easily deployable systems. If Roger is truly asking about a big data problem then the fact that his data arrived in the form of XML should not influence his choice of tool chain. Rather, he should be using the tools that are designed to deal with data volumes the size he mentions and solve the real problem, not just an intermediate step.

BTW, this arrived off list, feel free to put it back on list if you wish...

Peter Hunsberger

On Mon, Mar 9, 2015 at 6:36 PM, Ihe Onwuka <ihe.onwuka@gmail.com> wrote:

On Mon, Mar 9, 2015 at 6:06 PM, Peter Hunsberger <peter.hunsberger@gmail.com> wrote:
Yes, unless there is a need to forward on the XML to some other endpoint I can't really see why it would need to stay as XML?

Because it's easy to get it out of XML into whatever shape or form your analytics idea of the day/week/month/epoch needs it?