[
Lists Home |
Date Index |
Thread Index
]
thank you steven. that was the experiment i proposed a month or so ago.
and you have just shown very neatly that the entropy of the message
hasn't changed with the representation.
the only piece missing is to put out a file of a million 32 bit integers
(4MB by definition) and see how much it compresses - ie more than 50%?
then we really do have a lower bound on the entropy. i'm choosing to
ignore the compact formula/algorithmic representation at this stage
because that's not a general solution.
regards
rick
Steven J. DeRose wrote:
> At 16:31 +0000 2004-12-04, Michael Kay wrote:
> At 10:42 -0500 2004-12-04, tedd wrote:
>
>>> > In everything I have read, it appears that every chunk of content
>>>
>>>> must be encapsulated by tags, such as:
>>>>
>>> > <data>123.456</data>
>>
>> These are both legitimate XML documents. The question is, what are you
>> trying to achieve? If you use XML markup around the data, then an XML
>> parser
>> (and tools such as XSLT) will understand it. If you use commas, then you
>> have to parse it yourself. If you want to parse the data by hand,
>> then why
>> use XML in the first place?
>
>
> Michael makes a good point -- how to do it depends on your goals.
>
> But to be extra clear, there is no real way to make XML *itself* aware
> of fields that are delimited only by commas (or some other single
> character delimiter). Such syntax was considered as a possibility but
> rejected. SGML can do this via the SHORTREF feature, if you're
> absolutely set on it.
>
> For cases like your example, where there is very little structure to
> demarcate, it seems important: a million copies of "<data></data>"
> versus "," adds up. However, consider:
>
> 1: if your data is "text files that are literally tens of thousands of
> characters in length", that is small enough that the overhead won't
> disturb most software running even on a cell phone. If we were talking
> many millions or billions of *records*, then this would be more of an
> issue (as it is for some users).
>
> 2: If you want the data formatted by CSS or XSL-FO, or transformed by
> XSLT, or whatever, having all the data in one syntax that the
> applications *already* know about is much easier than rewriting the
> applications or working around them to add some syntax (like commas)
> that they *don't* know about. You'll never have to debug the XML
> parser you use to parse all those "<data>" tags, but you will spend a
> lot of time if you try to introduce a new syntax in your process.
>
> 3: Any text file that contains zillions of instances of a certain
> string, is necessarily very compressible. The first thing a
> compression program will do is discover that "<data>" is real common,
> and assign it a really short code. A comma-delimited file is
> inherently less compressible.
>
> Here are some empirical results:
>
> I created a file with the numbers from one to a million, delimited in
> different ways. zero.dat has just a linefeed between numbers;
> comma.dat just has a comma and a linefeed; tag01 has a start and
> end-tag with the one-character element type "d" (and the linefeed);
> tag02 has element type "da", on up to tag20 which has a
> 20-character-long element type. 5-line Ruby program available on request.
>
> Here are the original sizes:
>
> 6888888 4 Dec 13:35 zero.dat
> 7888887 4 Dec 12:50 comma.dat
> 13888881 4 Dec 12:51 tag01.dat
> 15888879 4 Dec 12:52 tag02.dat
> 17888877 4 Dec 12:53 tag03.dat
> 19888875 4 Dec 12:56 tag04.dat
> 21888873 4 Dec 13:06 tag05.dat
> 31888863 4 Dec 13:07 tag10.dat
> 51888843 4 Dec 13:08 tag20.dat
>
> Here are the sizes after gzipping:
>
> 2129148 4 Dec 13:35 zero.dat.gz
> 2130082 4 Dec 12:50 comma.dat.gz
> 2377733 4 Dec 12:51 tag01.dat.gz
> 2376912 4 Dec 12:52 tag02.dat.gz
> 2518197 4 Dec 12:53 tag03.dat.gz
> 2638489 4 Dec 12:56 tag04.dat.gz
> 2631120 4 Dec 13:06 tag05.dat.gz
> 2661673 4 Dec 13:07 tag10.dat.gz
> 2596261 4 Dec 13:08 tag20.dat.gz
>
> You can see that:
>
> the linefeed-only file reduces to 2130082 / 6888888 = 31% of its
> original size
> the 20-char tagged file reduces to 2596261 / 51888843 = 5% of its
> original size
>
> And even though the 20-char tagged file was over 7.5 times bigger than
> the linefeed-only file when uncompressed, once they're compressed it
> is only about 1.2 times bigger -- a mere 22% increase despite every
> number having 2 tags with 20-character tag names, instead of nothing
> but a line break.
>
> I wouldn't worry about the extra bytes much. If you've got enough data
> for it to matter, buy a disk-compression utility and you can forget
> the issue.
>
> Steve DeRose
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>
begin:vcard
fn:Rick Marshall
n:Marshall;Rick
email;internet:rjm@zenucom.com
tel;cell:+61 411 287 530
x-mozilla-html:TRUE
version:2.1
end:vcard
|