OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] XML not ideal for Big Data

I'd also raise another point here, and this is an issue that I've had before in the discussion of streaming: There comes a certain stage where it makes far more sense to plan your data strategies around an XML database, use some intelligent indexing to decompose complex documents into simpler ones, and use XQuery and the like to more effectively manage that data in a more cohesive manner.

I keep encountering these stories and shake my head - it's like someone who keeps a 100 MB CSV file and runs a Perl parser on it in lieu of storing this into a relational database, then complains that SQL Relational databases are just too damned poor at handling data management, when reality it's usually that they are not willing to make the investment to learn SQL properly and prefer working in Perl, and not willing to invest the time to build efficient data architectures.

This is also a reason why I think that XML best practices will ultimately end up promoting XML RESTful Services (aka XRX or MODS) and XML Databases over method (and fomat dependent) SOA systems. Store your internal data in an XML repository, assign URLs to collections as well as individual entries, let each resource in those collections have both multiple input and output representations and so forth, and bind XQuery operations to each representation.  This means it doesn't matter whether your wireformat is JSON or XML or HTML or YAML - so long as you have the relevant representation processors, the internal data abstraction and querying remain the same.

Of course, that would require that the author spend some time learning XQuery. What is it about programmers that if the code isn't in their absolute favorite language then they think the technology sucks?

Kurt Cagle
Managing Editor

On Thu, Sep 3, 2009 at 11:59 AM, Liam Quin <liam@w3.org> wrote:
On Thu, Sep 03, 2009 at 11:53:40AM -0400, Simon St.Laurent wrote:
> Perhaps there were better ways to have made XML work with his
> problems... but I think on the whole he's right.
> http://dataspora.com/blog/xml-and-big-data/

Nonsense, XML is perfect! :-)

OK, I'll be serious.

Today, loading a few tens of gigabytes of XML into
Oracle or DB2 or SQL Server isn't likely ot be such a
huge bottleneck in performance (and if you find yourself
loading data into a database on a daily basis, you
should ask yourself why you are using the database).

Of course, today one could use MarkLogic, Qixz, DB XML,
or any of a number of other native-XML databases.

There's nothing to say one has to use XML of course.

In its natural habitat, data lives in relational databases or as data
structures in programs. The common import and export formats of these
environments do not resemble XML, so much effort is dedicated to making
XML fit.

I'd argue that there's typically more information _outside_ the
databases, in documents.  Documents are data too.

The articles complaint that the redundancy of XML tags is a
bad thing is misplaced: there's a trade-off between making
the data robust against errors, and easy to debug, vs size.

People write to me every so often and say we should bring
back </>and I give them an example like,
   <title>Simon Green</author>
   <author>Jennifer Lumpnose</title>

With </> we get
   <title>Simon Green</>
   <author>Jennifer Lumpnose</>

and there's no XML error.  But the correct markup
should have been
   <author>Simon Green</author>
   <title>Jennifer Lumpnose</title>

That is, it was the start tags that the programmer had
transposed by mistake.  Using </> reduces the chance of
catching that error considerably, and there's often no
automatic checking available.

Of course, extreme crazy tagging is a disease all too
common -- I've done it too -- no argument there.

The argument about "we already know LaTeX and don't
want to learn something else" carries weight in the
present, but if the future is longer than the past,
it's an _awful_ lot longer than the present!

For my part I'd rather receive a terabyte of data in XML
than a gigabyte of undocumented binary data -- if bit 3 is
set then the 9 following bits represent the number of bytes
in the header of the next chunk, unless the current chunk is
the last in a segment, in which case the next header will be
168 bytes in length... no thanks.  The month you spend
writing the software to read it, and the six months you
spend deugging, and the time everyone else working with
the format does the same, will never be paid back in
most cases.

It'll be interesting to see if "Efficient XML Interchange"
makes a difference here, though.

I think the bottom line is that badly-done XML projects
are bad, but a mediocre XML prject is often better than
a mediocre or even good project with entirely custom formats.


Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/
http://www.holoweb.net/~liam/ * http://www.fromoldbooks.org/


XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS