[Updated] Big XML Data: Here is a comprehensive questionnaire, areal wor

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
[Updated] Big XML Data: Here is a comprehensive questionnaire, areal world Big XML Data problem, and suggested solutions
From: "Costello, Roger L." <costello@mitre.org>
To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
Date: Sun, 8 Mar 2015 11:44:43 +0000
Hi Folks,

Below I updated the questionnaire and the suggested solutions based on Simon and Liam's very insightful posts.  I found this particularly enlightening 

	There isn't a single right answer.

	Sometimes it's about extracting just the bits you need to 
	query and putting just those in some kind of database - 
	relational, NoSQL, SlightlySQL, triple store, XML-native, hybrid, 
	whatever, with no hope of reconstructing the original -- if you 
	need the original, just use it directly.

	Sometimes it's worth writing custom code - a database importer, 
	for example - for performance reasons.

	Sometimes some percentage of your queries actually rely on 
	querying markup in mixed content, or on relationships between 
	parts not explicitly stored.

	The direction I see is more hybrid stores (from Virtuoso to MarkLogic) 
	and more variation being acceptable as people come to recognize 
	that different needs are best served with different technologies.

/Roger

---------------------------------------------------------------------------------------------
    Got a Big XML Data problem? Then ask and answer these questions
---------------------------------------------------------------------------------------------
>	How many XML files are to be stored and queried? How big are they?

There are 50 million XML files, each 50MB in size. That means the queries are on 2.4Petabytes (2400Terabytes) of data.

>	What's the complexity of the XML: is there deep nesting or is it flat?

The files are mostly flat (not deeply nested).

>	Are the XML files volatile or static?

The XML files are relatively static - a few are updated for errors but most stay the same.

>	Are there requirements for further processing or consuming them as XML 
>	elsewhere or are they just a query source?

The XML files are just a query source. The results of the queries on the XML documents are used as input to SAS and SPSS analytics.

>	What type of queries, with what frequency?

We want multiple people to query multiple times a day. Right now the query frequency is low because the queries take days to run.

>	What kind of queries do you need to perform? Full text queries? XPath? XQuery?

The queries use XPath and XQuery.

>	Do you know or care what the document vocabularies are?

The XML elements and attributes are very well known. The structure of the XML is well known.

> 	Does every query always query every document?

The queries are across many or all of the 50 million XML documents.

> 	Does every document use the same schema? 

No, there are 40 XML Schemas. 

> 	How widely varied are the schemas?

The XML Schemas are quite similar.

> 	Does the XML have mixed content?

No, the XML is fully structured.

---------------------------------------------------------------------------------------------
                                              The Key Question
---------------------------------------------------------------------------------------------
What is your recommendation for storing and querying this huge amount of XML?

---------------------------------------------------------------------------------------------
                         Suggested solutions (and things to consider)
---------------------------------------------------------------------------------------------

Explore one of the following alternatives:

(A) Start a prototyping project to assess whether MarkLogic is capable of meeting the project requirements.

(B) Choose three native XML databases that look promising and assess each of the three to compare how well they handle the project requirements.

........

I just discovered  (by way of an abort) that there  is  a 64k limit on the number of distinct element and attribute names you can have in an eXist database and so am moving over to trial MarkLogic. 

Generally I have found that compacting the size and number of nodes in your XML can mitigate the onset of the dreaded heap space error but such techniques can only take you so far especially when  querying across collections.

........

I'd likely set up parsing them into a database.  Given relatively flat, that might be Cassandra though the query language might not be sufficient for your needs. Overall, the whole Hadoop related tool chain might be a better fit for the larger problem.  Alternately, something like Titan might be even better but would likely require some tooling to get up and sucking in the data. However, depending on what the SAS and SPSS portions of this are used for it could potentially yield "answers" directly.

........

Interesting data set. Querying 2.4PB (2400TB) of data is going to take a long time regardless of how you do it.

I'm assuming that duplicating that much data is off the table, but if not you would do best to optimize the storage mechanism to suit the query/retrieval access patterns. MarkLogic is marketed as petabyte-scale, but XML may not be the most optimal representation of the data, it is worth considering the alternatives.

Distributed storage and query are probably the way forward. If not already I'd be looking into Hadoop MapReduce, Presto, etc. Here is a high-level discussion of querying 300PB data warehouse: 
https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

........

> MarkLogic is marketed as petabyte-scale, but XML may not be 
> the most optimal representation of the data, it is worth 
> considering the alternatives.

Luckily neither MarkLogic nor most other XML-native databases store XML;  rather, they store an internal representation that can be mapped to the XPath and XQuery Data Model (the XDM).

A common mistake people make is thinking that XML databases store XML (i.e. store pointy brackets), which is a bit like thinking that relational databases operate on CSV files.

........

I'd go with MarkLogic for that. If the large documents are segmentable. I've run upwards of 100 million XML document files comfortably on a three node cluster.

........

Many people are importing the data in the XML into a relational database, with the exception of people who have XML that has mixed content.

........

There isn't a single right answer.

Sometimes it's about extracting just the bits you need to query and putting just those in some kind of database - relational, NoSQL, SlightlySQL, triple store, XML-native, hybrid, whatever, with no hope of reconstructing the original -- if you need the original, just use it directly.

Sometimes it's worth writing custom code - a database importer, for example - for performance reasons.

Sometimes some percentage of your queries actually rely on querying markup in mixed content, or on relationships between parts not explicitly stored.

The direction I see is more hybrid stores (from Virtuoso to MarkLogic) and more variation being acceptable as people come to recognize that different needs are best served with different technologies.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]