[
Lists Home |
Date Index |
Thread Index
]
Thanks for this analysis, Steve. Much appreciated.
The oft-mentioned issues here are hardware related, specifically,
large arrays of small sensors that communicate back to software.
The applications aren't hard to imagine. Whereas one can argue
that XML is an inappropriate format for these devices, this is
a more severe case than that for cellphones and other portable
devices (may or may not be battery starved). They are asked to
use XML and only then do the implications become apparent (design
by committee and RFP).
In the case of XML documents such as X3D, they simply are very
large and there are datatype issues. GZIP and modem compression
proved insufficient over time. Also, some customers demand
a binary. They don't expose their content to inspection. Yes, it
can be cracked; no, they don't care. The rendering of real time
systems apparently is also an issue, and no this is not my area of
expertise, but slowing down a real time framerate is the ultimate
sin in that world. In the case of X3D, a binary is just another
encoding in a standard that was built for three encodings and uses
the object model to keep these coherent. These customers want to
use web services with the distributed simulations (XMSF). FI is
an important contender in the organization responsible for that
as Sun has been a faithful participant since the beginning of VRML,
and now X3D. (Showing up is 90% of many efforts.)
len
From: Steven J. DeRose [mailto:sderose@acm.org]
This has been interesting.
A few points that I think are important:
1: What operations do you want to do?
Whether binary formats "are faster" depends on this more than anything else.
If you want to "generate the DSIG digital signature" it's hard to
imagine that a binary format could ever be faster, because the DSIG
is generally defined on the text stream -- so to correctly calculate
it, you'd have to crank through your binary structure and re-create
the text stream on the fly. That costs more time than simply scanning
a text stream that's already there.
On the other hand, if you want to "skip to the next section" a
halfway decent binary format will clean up. That's because it will
have a pointer and get there in one step, while the parser has to go
looking. Even if the looking is really quick, it's slower than not
looking at all.
If you want to examine the XPath "ancestors" axis, resolve IDREFs,
and so on, you'll be best off with a binary thing that knows all that
already because it figured it out when it was first generated. If you
want "preceding", all bets are off unless the implementor
specifically designed for it. You want all occurrences of "the",
binary and text will be about the same, and you'd need a full-text
index to improve it.
This reminds me to mention that what someone said about it being
impossible to say what is faster because implementations differ, is
oversimplified. Of course an incompetent implementation of either
type can manage to be immeasurably bad. But some algorithms are
inherently faster than others; and binary representations have a
larger choice of algorithms.
2: Where is the data kept?
Often the biggest speed factor of all is what data is in RAM vs. disk
vs. over the net. RAM is about 10,000 times faster than disk (very
expensive disk seeks). Binary formats, historically, are intended to
overcome this obstacle.
Jeff Vogel and I wrote the first binary SGML implementation that
could handle large documents, in early 1990. Our staff tweaked its
parser for XML later. Because we didn't have to touch the binary
representation at all, it would not be stretching much to say we had
a binary XML representation up and running in 1990. For the usual
operations required to search and render documents, nothing I've seen
yet has been faster for big documents. But it is uncommon these days
that single XML documents are too big to be kept in RAM.
The company was Electronic Book Technologies, the product was
DynaText, and it was mainly used for *really* big documents, like
F-16 manuals that on paper would outweigh the plane. Typical size for
a *single* document was 10-250MB. You could open a document that big,
go to anyplace determined by an XPath-like expression, render, and
have the text on the screen in about 1 second. If you want an
interesting contrast, make yourself a 1MB HTML file, open it in a
browser, scroll to the bottom, and then resize horizontally.
On the other hand, it was purely a delivery system, and you couldn't
update in place although the binary format used theoretically could.
There are at least 11 patents on it, so anyone can go see one way to
design a binary XML format that fast (though some of the cooler
tweaks were post-patent). One expects that any committees involved
would do that. Perhaps after 15 years they could do something
substantially better -- but we'll see.
3: What does "lossless" mean?
A few other recent postings have mentioned this issue. I think most
people would consider a format "lossless" if you could export from it
back into XML syntax, and when you parsed the resulting XML you got
the same DOM as for the original document. If that's enough, it's not
hard to make a lossless binary format (and mine was lossless, except
I think it discarded comments and PIs). HOWEVER, this is not
completely lossless. You still lose (among other things):
* the entity structure
* being able to get a matching DSIG
* all sorts of really ugly whitespace normalization details
(including within tags)
* single- versus double-quoting of attributes
* namespace prefix usage
* order of attributes
* <br /> vesus <br></br>
So, until you define "lossless", there's no point in comparing
whether two products are lossless or not.
Transportability also poses problems. For example, if you mean to
move from one system to another, you have to worry about any binary
numbers you store -- some systems store the high-order bytes first,
some store them last. We insisted on making our binaries readable
across platforms, and that involves a lot of byte-swapping overhead
that XML parsers never have to mess with.
4: Hybrid solutions
If you only need to optimize certain operations, you can do it within
XML: Make a pass over the file and add attributes as needed. In the
right setting, this could be really fast (though it's harder than it
looks):
<sec b:next-sibling-offset='99999' prev-sibling-offset='241'...>
Also, if you mainly need to optimize resolving IDREFs, just make a
separate index that says where they are, and leave the XML as is.
XLink works nice for this.
What I'm saying overall is that the solution space is much wider than
it may appear, and the answers are more complex. Also, that it can
be, and has been, done successfully. But except for really huge
documents, I don't think it's usually worth the effort.
Steve
|