Lists Home |
Date Index |
Even though CSV is much more efficient for distributing large data arrays,
you're certainly correct about the perils of CSV sans any metadata.
While the row and cell tags generated by Excel are a form of metadata -- in
that assure the data are parsed to the proper cells in a spreadsheet -- they
say nothing of what those data mean. One way is to add a "header row" in the
CSV defining what the data elements parsed into each spreadsheet column are
(i.e., a "field" definition a la relational databases).
But even that would not give you XML's hierarchical/associative
capabilities; to do so would require additional data.
I know of some innovative/proprietary ways to use CSVs and spreadsheets to
replicate the full array data capabilities of XML, including a way to manage
a wealth of meaningful metadata and formatting instructions, while keeping
the CSV data trimmed down to its streamlined essence (i.e., the ability to
send 17 million data elements in a 150KB file that is rapidly uncompressed
and parsed). But this is not a discussion for this forum. Anyone interest
can contact me off-list.
From: Bill Kearney [mailto:email@example.com]
Sent: Monday, December 06, 2004 3:56 PM
To: Stephen E. Beller; firstname.lastname@example.org
Subject: Re: [xml-dev] Data streams
This also speaks to the somewhat verbose form of XML that Office might be
It's certainly no surprise to anyone that the data was larger and compressed
differently in XML than CSV. Especially not with the example you proposed.
I think your conclusion about CSV effectiveness is short-sighted. While CSV
can certainly be "bit stingy" it often comes at the considerable cost of
being brittle. Without effective metadata those numbers just become
gibberish. While it's fair to say an XML file may be larger it does so in a
remarkably self-documenting way. Where's the balance to be struck? In
lightweight CSV that's fraught with processing perils? Or in methodically
documented XML that simply takes a few cycles longer? CPU and Disk is
cheap, programming time and budget to work around crappy, brittle data
It might be a more interesting experiment to discuss using more
purpose-built XML schemas. Doing a better job of describing the data in
with XML without being so verbose. While Office may not offer it at this
point that doesn't preclude others from doing a better job of it.
----- Original Message -----
From: "Stephen E. Beller" <email@example.com>
> I tried Steven's experiment from a different angle. I filled an Excel XP
> spreadsheet with a single-digit number, saved it in both XML and in a
> comma-delimited text file (CSV). I then compressed both with WinZip and
> opened both with Excel. Here's what I found:
> The XML file was 840MB, the CSV 34MB -- a 2,500% difference
> Compressed, the XML file was 2.5MB, the CSV 0.00015MB (150KB) -- a 1,670%
> Equally dramatic is the time it took to uncompress and render the files as
> an Excel spreadsheet: It took about 20 minutes with the XML file; the CSV
> took 1 minute -- a 2,000% difference.
> My conclusion is that delimited text files handle large arrays of data
> efficiently. This stems, in part, from the fact that a comma delimiter (or
> some other single character) carries much less overhead than tags; CSV
> requires only a comma, while XML requires a minimum of 5 characters
> -- that's makes CSV a minimum of 500% more efficient ... and when you add
> the semantic labels and attributes to the tags, and the size of XML
> increases dramatically.
> Note, however, that when dealing with large blocks of text instead of
> numbers (or small text strings), the difference between XML and delimited
> text files is considerably less.
> Of course, XML offers benefits that a plain data array in a CSV file does
> not, such as attribute definitions and hierarchical associations between
> data (if that's necessary) ... even though there are ways comma-delimited
> data can be used to perform the same functions of XML when rendering
> serialized data arrays as charts.