Re: [xml-dev] XML not ideal for Big Data

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Liam Quin <liam@w3.org>
To: "Simon St.Laurent" <simonstl@simonstl.com>
Date: Thu, 3 Sep 2009 14:59:08 -0400

On Thu, Sep 03, 2009 at 11:53:40AM -0400, Simon St.Laurent wrote:
> Perhaps there were better ways to have made XML work with his 
> problems... but I think on the whole he's right.
> 
> http://dataspora.com/blog/xml-and-big-data/

Nonsense, XML is perfect! :-)

OK, I'll be serious.

Today, loading a few tens of gigabytes of XML into 
Oracle or DB2 or SQL Server isn't likely ot be such a
huge bottleneck in performance (and if you find yourself
loading data into a database on a daily basis, you
should ask yourself why you are using the database).

Of course, today one could use MarkLogic, Qixz, DB XML,
or any of a number of other native-XML databases.

There's nothing to say one has to use XML of course.

[[
In its natural habitat, data lives in relational databases or as data
structures in programs. The common import and export formats of these
environments do not resemble XML, so much effort is dedicated to making
XML fit.
]]

I'd argue that there's typically more information _outside_ the
databases, in documents.  Documents are data too.

The articles complaint that the redundancy of XML tags is a
bad thing is misplaced: there's a trade-off between making
the data robust against errors, and easy to debug, vs size.

People write to me every so often and say we should bring
back </>and I give them an example like,
    <title>Simon Green</author>
    <author>Jennifer Lumpnose</title>

With </> we get
    <title>Simon Green</>
    <author>Jennifer Lumpnose</>

and there's no XML error.  But the correct markup
should have been
    <author>Simon Green</author>
    <title>Jennifer Lumpnose</title>

That is, it was the start tags that the programmer had
transposed by mistake.  Using </> reduces the chance of
catching that error considerably, and there's often no
automatic checking available.

Of course, extreme crazy tagging is a disease all too
common -- I've done it too -- no argument there.

The argument about "we already know LaTeX and don't
want to learn something else" carries weight in the
present, but if the future is longer than the past,
it's an _awful_ lot longer than the present!

For my part I'd rather receive a terabyte of data in XML
than a gigabyte of undocumented binary data -- if bit 3 is
set then the 9 following bits represent the number of bytes
in the header of the next chunk, unless the current chunk is
the last in a segment, in which case the next header will be
168 bytes in length... no thanks.  The month you spend
writing the software to read it, and the six months you
spend deugging, and the time everyone else working with
the format does the same, will never be paid back in
most cases.

It'll be interesting to see if "Efficient XML Interchange"
makes a difference here, though.

I think the bottom line is that badly-done XML projects
are bad, but a mediocre XML prject is often better than
a mediocre or even good project with entirely custom formats.

Liam

-- 
Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/
http://www.holoweb.net/~liam/ * http://www.fromoldbooks.org/

Follow-Ups:
- Re: [xml-dev] XML not ideal for Big Data
  - From: Kurt Cagle <kurt.cagle@gmail.com>

References:
- XML not ideal for Big Data
  - From: "Simon St.Laurent" <simonstl@simonstl.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]