Re: [xml-dev] What is the general direction you are seeing these daysto store and query lots of large complex XML?
Well it's not exactly like it's hard to get data out of Casandra, Neo4J, Titan or any relational database you might choose. In fact I'd argue it's easier to build a generic services oriented endpoint on top of those then on top of 50 million XML files.... Both Titan and Neo4j have JSON as native output and there is, for example, a SPARQL plugin for NEO4J and similarly Rexster for Titan will give you a SPARQL endpoint (should that be your flavor of the month).
However, once more I'll emphasize that it seems very likely that the proper way to manage this is a big data problem using the tools designed for that and not to aim at feeding tool sets designed for other problems. The analytics should run directly on the data, not on some extract. Recently I saw a complaint about Neo4J taking 7 minutes to traverse a billion nodes and the developers wanted to see diagnostics to figure out why things were taking so long, These tools are designed for finding answers hidden within very, very large data sets. As a very rough guess, assuming some degree of normalization is possible, Rogers entire data set might equate to something like 2 to 3 billion nodes and 3 to 5 billion edges which would be manageable in a small Titan cluster. Titan has been used with graphs of 100 billion edges...