Re: [xml-dev] What is the general direction you are seeing these daysto

Steve hits on an important part of my reasoning. For example, you can take something like Hadoop and run variations of analysis iteratively. So let's say you're doing a (now classic) friends of friends analysis which is known to have polynomial complexity as you increase the relationship depth. For a given set of users that depth can vary considerably depending on how far away any given person is from a "super node" or other data patterns. Set an upper bound on execution time and start running the analysis, continually increasing the depth until you hit that bound. You're going to pull out way more interesting data; things like there is a 40% chance of knowing somebody that know somebody that knows Kevin Bacon and an 70% chance of knowing someone at 4 steps. etc. If you're dealing in statistical analysis then the algorithms are already coded up for many common analysis and it's just a case of configuring them for a given use case. Yes, you are talking about entire new sets of infrastructure and skill set for many organizations, but the gain is the ability to perform many orders of magnitude more analysis tasks, perform them many orders of magnitude faster, and perform them over many more magnitudes of volume of data.

Having said all that, I do have to qualify it: I don't know the business domain, I don't know your organization and I don't know your organizations technical capabilities. I'm making this recommendations based purely on two things: you have a huge volume of data and you tell us you want to feed something into SAS and SPSS. I'm assuming that this is part of a larger set of analysis that is ongoing and that it is worth some considerable investment to build a tool set to get the benefits I describe above....

Peter Hunsberger

On Wed, Mar 11, 2015 at 4:14 AM, Costello, Roger L. <costello@mitre.org> wrote:

Hi Folks,

Peter made a very interesting assertion:

The analytics should run directly on the data,

not on some extract.

My plan was to perform XPath and XQuery on the 50 million XML documents and then use the query results as input into SAS and SPSS analytics. So my approach is quite different than what Peter advocates.

Peter, why do you assert that the analytics should be run directly on the data? Why is that superior to querying the data and using the query results as input to the analytics? Does everyone agree with Peter that the analytics should be run directly on the data? Anyone disagree with Peter?

/Roger