[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: Textual transmission typically faster?
- From: David Megginson <david@megginson.com>
- To: xml-dev@lists.xml.org
- Date: Thu, 18 Jan 2001 10:15:50 -0500 (EST)
Eugene Kuznetsov writes:
> That shouldn't be surprising -- generally speaking a parse tree is
> larger than the text input to the parser. (And java serialization
> doesn't win any awards for efficiency in either time or space).
>
> However, a binary representation of the same data (specified in
> ASN.1 and encoded using BER, say) would be much smaller and more
> efficient.
Consider
<foobar>data</foobar>
If we are using UTF-8, this will require 21 bytes.
In a non-navigable binary format (i.e. for one-pass reading, like XML
itself), we would have the "foobar" name in a name table and would end
up with something like this:
1 byte: type code for start element
4 bytes: pointer to name "foobar" in table
2 bytes: number of attributes (0)
1 byte: type code for text
4 bytes: data length (4)
4 bytes: data
1 byte: type code for end element
Using UTF-8 (again) this gives us 17 bytes, a slight improvement. If
the amount of text were increased to only say, 20 bytes, then the
ratio would be 37:33 rather than 21:17 (not that exciting at all), and
that's pre-compression. If you knew for sure that none of your
documents would be very long, you could shorten some of the pointers
to 3 or even 2 bytes, but then your format wouldn't be generally
usable for XML processing.
In a navigable binary format, things become much worse, because the
binary format needs left/right sibling and parent/first-child
pointers, so we end up with something like this:
1 byte: type code for start element
4 bytes: pointer to name in table
4 bytes: pointer to left sibling
4 bytes: pointer to right sibling
4 bytes: pointer to parent
4 bytes: pointer to first child
2 bytes: number of attributes (0)
1 byte: type code for text
4 bytes: data length (4)
4 bytes: data
1 byte: type code for end element
That gives us a total of 33 bytes (assuming no navigation pointers at
the end of the element) in binary format vs. 21 bytes in clear text,
and that's before compression. There would be significant advantages
to such a binary format, though, since you could create a DOM
interface that could navigate large documents without loading them
completely into memory first. Just don't count on file-system space
savings.
As for time efficiency, that depends on my much time people put into
writing the loaders. James Clark's C-based Expat already parses XML
at pretty-much I/O speed, so you'd be hard pressed to beat it. The
Java-based XML parsers aren't quite as fast, but they're still pretty
impressive (there was a bit of friendly rivalry a couple of years ago
over benchmarks, and several of the main XML parsers were optimized
almost beyond belief).
Even if you could write a one-pass binary-format parser that could
run, say, 20% faster than the best XML parsers in the same language,
that advantage would make almost no difference to total application
time, since, in my experience, actual XML parsing (as opposed to
building object trees, DOM or otherwise) accounts for well under 5% of
execution time for even the simplest real-world applications. A 20%
improvement in parsing time would give you at most a 1% improvement in
application execution time. If the document is coming over the net
rather than from a local drive, even that small advantage will
probably be lost in network latency.
In summary, then, there are some good arguments for writing a
navigable XML-like binary format to allow large documents to be
processed with a DOM without loading the whole document into memory.
I can see very little argument for using a binary format simply for
space or efficiency.
All the best,
David
--
David Megginson david@megginson.com
http://www.megginson.com/