Big XML Data: Here is a comprehensive questionnaire, a real worldBig XML

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Big XML Data: Here is a comprehensive questionnaire, a real worldBig XML Data problem, and suggested solutions
From: "Costello, Roger L." <costello@mitre.org>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Fri, 6 Mar 2015 12:24:25 +0000
Hi Folks,

A truly outstanding and valuable discussion! Thank you so much.

I want to summarize. One valuable outcome of this discussion is we have come up with a comprehensive set of questions to ask when trying to decide what solution to use with a Big XML Data problem. Below is a list of the questions with answers for my real world Big XML Data problem. After the questions are solutions that have been suggested.

---------------------------------------------------------------------------------------------
    Got a Big XML Data problem? Then ask and answer these questions
---------------------------------------------------------------------------------------------
>	How many XML files are to be stored and queried? How big are they?

There are 50 million XML files, each 50MB in size. That means the queries are on 2.4Petabytes (2400Terabytes) of data.

>	What's the complexity of the XML: is there deep nesting or is it flat?

The files are mostly flat (not deeply nested).

>	Are the XML files volatile or static?

The XML files are relatively static - a few are updated for errors but most stay the same.

>	Are there requirements for further processing or consuming them as XML 
>	elsewhere or are they just a query source?

The XML files are just a query source. The results of the queries on the XML documents are used as input to SAS and SPSS analytics.

>	What type of queries, with what frequency?

We want multiple people to query multiple times a day. Right now the query frequency is low because the queries take days to run.

>	What kind of queries do you need to perform? Full text queries? XPath? XQuery?

The queries use XPath and XQuery.

>	Do you know or care what the document vocabularies are?

The XML elements and attributes are very well known. The structure of the XML is well known.

> Does every query always query every document?

The queries are across many or all of the 50 million XML documents.

> 	Does every document use the same schema? 

No, there are 40 XML Schemas. 

> 	How widely varied are the schemas?

The XML Schemas are quite similar.

---------------------------------------------------------------------------------------------
                                              The Key Question
---------------------------------------------------------------------------------------------
What is your recommendation for storing and querying this huge amount of XML?

---------------------------------------------------------------------------------------------
                         Suggested solutions (and things to consider)
---------------------------------------------------------------------------------------------

Explore one of the following alternatives:

(A) Start a prototyping project to assess whether MarkLogic is capable of meeting the project requirements.

(B) Choose three native XML databases that look promising and assess each of the three to compare how well they handle the project requirements.

........

I just discovered  (by way of an abort) that there  is  a 64k limit on the number of distinct element and attribute names you can have in an eXist database and so am moving over to trial MarkLogic. 

Generally I have found that compacting the size and number of nodes in your XML can mitigate the onset of the dreaded heap space error but such techniques can only take you so far especially when  querying across collections.

........

I'd likely set up parsing them into a database.  Given relatively flat, that might be Cassandra though the query language might not be sufficient for your needs. Overall, the whole Hadoop related tool chain might be a better fit for the larger problem.  Alternately, something like Titan might be even better but would likely require some tooling to get up and sucking in the data. However, depending on what the SAS and SPSS portions of this are used for it could potentially yield "answers" directly.

........

Interesting data set. Querying 2.4PB (2400TB) of data is going to take a long time regardless of how you do it.

I'm assuming that duplicating that much data is off the table, but if not you would do best to optimize the storage mechanism to suit the query/retrieval access patterns. MarkLogic is marketed as petabyte-scale, but XML may not be the most optimal representation of the data, it is worth considering the alternatives.

Distributed storage and query are probably the way forward. If not already I'd be looking into Hadoop MapReduce, Presto, etc. Here is a high-level discussion of querying 300PB data warehouse: 
https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

........

> MarkLogic is marketed as petabyte-scale, but XML may not be 
> the most optimal representation of the data, it is worth 
> considering the alternatives.

Luckily neither MarkLogic nor most other XML-native databases store XML;  rather, they store an internal representation that can be mapped to the 
XPath and XQuery Data Model (the XDM).

A common mistake people make is thinking that XML databases store XML (i.e. store pointy brackets), which is a bit like thinking that relational databases operate on CSV files.

........

I'd go with MarkLogic for that. If the large documents are segmentable. I've run upwards of 100 million XML document files comfortably on a three node cluster.

........
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]