xml-dev - Re: [xml-dev] Data streams

Re: [xml-dev] Data streams

[ Lists Home | Date Index | Thread Index ]

To: "Steven J. DeRose" <sderose@acm.org>
Subject: Re: [xml-dev] Data streams
From: Rick Marshall <rjm@zenucom.com>
Date: Sun, 05 Dec 2004 08:49:38 +1100
Cc: xml-dev@lists.xml.org
In-reply-to: <a06020406bdd79c6bedd2@[192.168.1.107]>
Organization: Zenucom Pty Ltd
References: <E1Cacok-00023s-00@ukmail1.eechost.net> <a06020406bdd79c6bedd2@[192.168.1.107]>
User-agent: Mozilla Thunderbird 0.6 (X11/20040502)

thank you steven. that was the experiment i proposed a month or so ago. 
and you have just shown very neatly that the entropy of the message 
hasn't changed with the representation.

the only piece missing is to put out a file of a million 32 bit integers 
(4MB by definition) and see how much it compresses - ie more than 50%? 
then we really do have a lower bound on the entropy. i'm choosing to 
ignore the compact formula/algorithmic representation at this stage 
because that's not a general solution.

regards

rick

Steven J. DeRose wrote:

> At 16:31 +0000 2004-12-04, Michael Kay wrote:
> At 10:42 -0500 2004-12-04, tedd wrote:
>
>>>  > In everything I have read, it appears that every chunk of content
>>>
>>>>  must be encapsulated by tags, such as:
>>>>
>>>  > <data>123.456</data>
>>
>> These are both legitimate XML documents. The question is, what are you
>> trying to achieve? If you use XML markup around the data, then an XML 
>> parser
>> (and tools such as XSLT) will understand it. If you use commas, then you
>> have to parse it yourself. If you want to parse the data by hand, 
>> then why
>> use XML in the first place?
>
>
> Michael makes a good point -- how to do it depends on your goals.
>
> But to be extra clear, there is no real way to make XML *itself* aware 
> of fields that are delimited only by commas (or some other single 
> character delimiter). Such syntax was considered as a possibility but 
> rejected. SGML can do this via the SHORTREF feature, if you're 
> absolutely set on it.
>
> For cases like your example, where there is very little structure to 
> demarcate, it seems important: a million copies of "<data></data>" 
> versus "," adds up. However, consider:
>
> 1: if your data is "text files that are literally tens of thousands of 
> characters in length", that is small enough that the overhead won't 
> disturb most software running even on a cell phone. If we were talking 
> many millions or billions of *records*, then this would be more of an 
> issue (as it is for some users).
>
> 2: If you want the data formatted by CSS or XSL-FO, or transformed by 
> XSLT, or whatever, having all the data in one syntax that the 
> applications *already* know about is much easier than rewriting the 
> applications or working around them to add some syntax (like commas) 
> that they *don't* know about. You'll never have to debug the XML 
> parser you use to parse all those "<data>" tags, but you will spend a 
> lot of time if you try to introduce a new syntax in your process.
>
> 3: Any text file that contains zillions of instances of a certain 
> string, is necessarily very compressible. The first thing a 
> compression program will do is discover that "<data>" is real common, 
> and assign it a really short code. A comma-delimited file is 
> inherently less compressible.
>
> Here are some empirical results:
>
> I created a file with the numbers from one to a million, delimited in 
> different ways. zero.dat has just a linefeed between numbers; 
> comma.dat just has a comma and a linefeed; tag01 has a start and 
> end-tag with the one-character element type "d" (and the linefeed); 
> tag02 has element type "da", on up to tag20 which has a 
> 20-character-long element type. 5-line Ruby program available on request.
>
> Here are the original sizes:
>
>  6888888  4 Dec 13:35 zero.dat
>  7888887  4 Dec 12:50 comma.dat
> 13888881  4 Dec 12:51 tag01.dat
> 15888879  4 Dec 12:52 tag02.dat
> 17888877  4 Dec 12:53 tag03.dat
> 19888875  4 Dec 12:56 tag04.dat
> 21888873  4 Dec 13:06 tag05.dat
> 31888863  4 Dec 13:07 tag10.dat
> 51888843  4 Dec 13:08 tag20.dat
>
> Here are the sizes after gzipping:
>
>  2129148  4 Dec 13:35 zero.dat.gz
>  2130082  4 Dec 12:50 comma.dat.gz
>  2377733  4 Dec 12:51 tag01.dat.gz
>  2376912  4 Dec 12:52 tag02.dat.gz
>  2518197  4 Dec 12:53 tag03.dat.gz
>  2638489  4 Dec 12:56 tag04.dat.gz
>  2631120  4 Dec 13:06 tag05.dat.gz
>  2661673  4 Dec 13:07 tag10.dat.gz
>  2596261  4 Dec 13:08 tag20.dat.gz
>
> You can see that:
>
> the linefeed-only file reduces to 2130082 / 6888888 = 31% of its 
> original size
> the 20-char tagged file reduces to 2596261 / 51888843  = 5% of its 
> original size
>
> And even though the 20-char tagged file was over 7.5 times bigger than 
> the linefeed-only file when uncompressed, once they're compressed it 
> is only about 1.2 times bigger -- a mere 22% increase despite every 
> number having 2 tags with 20-character tag names, instead of nothing 
> but a line break.
>
> I wouldn't worry about the extra bytes much. If you've got enough data 
> for it to matter, buy a disk-compression utility and you can forget 
> the issue.
>
> Steve DeRose
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>

begin:vcard
fn:Rick  Marshall
n:Marshall;Rick 
email;internet:rjm@zenucom.com
tel;cell:+61 411 287 530
x-mozilla-html:TRUE
version:2.1
end:vcard

Follow-Ups:
- RE: [xml-dev] Data streams
  - From: "Stephen E. Beller" <sbeller@nhds.com>

References:
- RE: [xml-dev] Data streams
  - From: "Michael Kay" <mike@saxonica.com>
- RE: [xml-dev] Data streams
  - From: "Steven J. DeRose" <sderose@acm.org>

Prev by Date: RE: [xml-dev] Data streams
Next by Date: Re: [xml-dev] XML Pipeline
Previous by thread: RE: [xml-dev] Data streams
Next by thread: RE: [xml-dev] Data streams
Index(es):
- Date
- Thread