[
Lists Home |
Date Index |
Thread Index
]
Title: RE: [xml-dev] Data streams
Steve:
A most excellent point and presentation. I'm aware of compression
schemes and it had not occurred to me to consider them in the final
analysis. I often think of compression in terms of data storage rather
than data transmission.
So, the bottom line here, is that using <data> 1234.5678
</data> for thousands of samples is perfectly legal, accepted,
recommended, and an efficient method of storing and retrieving
data.
Old habits die hard -- I'm just from the old school where we had
to worry about the size of things and it's not my nature to NOT to
consider those things. Of course, while size is still a consideration,
it's becoming less of one. I guess the last thing we want to do is to
be shortsighted as to future needs based upon todays perceived, but
temporary and lessoning, limitations (i.e., seven-bit ASCII).
Many thanks to all for your most generous time and excellent
advice.
tedd
---
At 16:31 +0000 2004-12-04, Michael Kay
wrote:
At 10:42 -0500 2004-12-04, tedd wrote:
> In everything I have read, it
appears that every chunk of content
must be encapsulated by tags, such
as:
>
<data>123.456</data>
These are both legitimate XML documents.
The question is, what are you
trying to achieve? If you use XML markup around the data, then an XML
parser
(and tools such as XSLT) will understand it. If you use commas, then
you
have to parse it yourself. If you want to parse the data by hand, then
why
use XML in the first place?
Michael makes a good point -- how to do it depends on your goals.
But to be extra clear, there is no real way to make XML *itself* aware
of fields that are delimited only by commas (or some other single
character delimiter). Such syntax was considered as a possibility but
rejected. SGML can do this via the SHORTREF feature, if you're
absolutely set on it.
For cases like your example, where there is very little structure to
demarcate, it seems important: a million copies of
"<data></data>" versus "," adds up.
However, consider:
1: if your data is "text files that are literally tens of
thousands of characters in length", that is small enough that the
overhead won't disturb most software running even on a cell phone. If
we were talking many millions or billions of *records*, then this
would be more of an issue (as it is for some users).
2: If you want the data formatted by CSS or XSL-FO, or transformed by
XSLT, or whatever, having all the data in one syntax that the
applications *already* know about is much easier than rewriting the
applications or working around them to add some syntax (like commas)
that they *don't* know about. You'll never have to debug the XML
parser you use to parse all those "<data>" tags, but
you will spend a lot of time if you try to introduce a new syntax in
your process.
3: Any text file that contains zillions of instances of a certain
string, is necessarily very compressible. The first thing a
compression program will do is discover that "<data>"
is real common, and assign it a really short code. A comma-delimited
file is inherently less compressible.
Here are some empirical results:
I created a file with the numbers from one to a million, delimited in
different ways. zero.dat has just a linefeed between numbers;
comma.dat just has a comma and a linefeed; tag01 has a start and
end-tag with the one-character element type "d" (and the
linefeed); tag02 has element type "da", on up to tag20 which
has a 20-character-long element type. 5-line Ruby program available on
request.
Here are the original sizes:
6888888 4 Dec 13:35 zero.dat
7888887 4 Dec 12:50 comma.dat
13888881 4 Dec 12:51 tag01.dat
15888879 4 Dec 12:52 tag02.dat
17888877 4 Dec 12:53 tag03.dat
19888875 4 Dec 12:56 tag04.dat
21888873 4 Dec 13:06 tag05.dat
31888863 4 Dec 13:07 tag10.dat
51888843 4 Dec 13:08 tag20.dat
Here are the sizes after gzipping:
2129148 4 Dec 13:35 zero.dat.gz
2130082 4 Dec 12:50 comma.dat.gz
2377733 4 Dec 12:51 tag01.dat.gz
2376912 4 Dec 12:52 tag02.dat.gz
2518197 4 Dec 12:53
tag03.dat.gz
2638489 4 Dec 12:56 tag04.dat.gz
2631120 4 Dec 13:06 tag05.dat.gz
2661673 4 Dec 13:07 tag10.dat.gz
2596261 4 Dec 13:08 tag20.dat.gz
You can see that:
the linefeed-only file reduces to 2130082 / 6888888 = 31% of its
original size
the 20-char tagged file reduces to 2596261 / 51888843 = 5% of
its original size
And even though the 20-char tagged file was over 7.5 times bigger than
the linefeed-only file when uncompressed, once they're compressed it
is only about 1.2 times bigger -- a mere 22% increase despite every
number having 2 tags with 20-character tag names, instead of nothing
but a line break.
I wouldn't worry about the extra bytes much. If you've got enough data
for it to matter, buy a disk-compression utility and you can forget
the issue.
Steve DeRose
-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org>,
an
initiative of OASIS <http://www.oasis-open.org>
The list archives are at http://lists.xml.org/archives/xml-dev/
To subscribe or unsubscribe from this list use the subscription
manager:
<http://www.oasis-open.org/mlmanage/index.php>
--
--------------------------------------------------------------------------------
http://sperling.com/
|