[
Lists Home |
Date Index |
Thread Index
]
This has been interesting.
A few points that I think are important:
1: What operations do you want to do?
Whether binary formats "are faster" depends on this more than anything else.
If you want to "generate the DSIG digital signature" it's hard to
imagine that a binary format could ever be faster, because the DSIG
is generally defined on the text stream -- so to correctly calculate
it, you'd have to crank through your binary structure and re-create
the text stream on the fly. That costs more time than simply scanning
a text stream that's already there.
On the other hand, if you want to "skip to the next section" a
halfway decent binary format will clean up. That's because it will
have a pointer and get there in one step, while the parser has to go
looking. Even if the looking is really quick, it's slower than not
looking at all.
If you want to examine the XPath "ancestors" axis, resolve IDREFs,
and so on, you'll be best off with a binary thing that knows all that
already because it figured it out when it was first generated. If you
want "preceding", all bets are off unless the implementor
specifically designed for it. You want all occurrences of "the",
binary and text will be about the same, and you'd need a full-text
index to improve it.
This reminds me to mention that what someone said about it being
impossible to say what is faster because implementations differ, is
oversimplified. Of course an incompetent implementation of either
type can manage to be immeasurably bad. But some algorithms are
inherently faster than others; and binary representations have a
larger choice of algorithms.
2: Where is the data kept?
Often the biggest speed factor of all is what data is in RAM vs. disk
vs. over the net. RAM is about 10,000 times faster than disk (very
expensive disk seeks). Binary formats, historically, are intended to
overcome this obstacle.
Jeff Vogel and I wrote the first binary SGML implementation that
could handle large documents, in early 1990. Our staff tweaked its
parser for XML later. Because we didn't have to touch the binary
representation at all, it would not be stretching much to say we had
a binary XML representation up and running in 1990. For the usual
operations required to search and render documents, nothing I've seen
yet has been faster for big documents. But it is uncommon these days
that single XML documents are too big to be kept in RAM.
The company was Electronic Book Technologies, the product was
DynaText, and it was mainly used for *really* big documents, like
F-16 manuals that on paper would outweigh the plane. Typical size for
a *single* document was 10-250MB. You could open a document that big,
go to anyplace determined by an XPath-like expression, render, and
have the text on the screen in about 1 second. If you want an
interesting contrast, make yourself a 1MB HTML file, open it in a
browser, scroll to the bottom, and then resize horizontally.
On the other hand, it was purely a delivery system, and you couldn't
update in place although the binary format used theoretically could.
There are at least 11 patents on it, so anyone can go see one way to
design a binary XML format that fast (though some of the cooler
tweaks were post-patent). One expects that any committees involved
would do that. Perhaps after 15 years they could do something
substantially better -- but we'll see.
3: What does "lossless" mean?
A few other recent postings have mentioned this issue. I think most
people would consider a format "lossless" if you could export from it
back into XML syntax, and when you parsed the resulting XML you got
the same DOM as for the original document. If that's enough, it's not
hard to make a lossless binary format (and mine was lossless, except
I think it discarded comments and PIs). HOWEVER, this is not
completely lossless. You still lose (among other things):
* the entity structure
* being able to get a matching DSIG
* all sorts of really ugly whitespace normalization details
(including within tags)
* single- versus double-quoting of attributes
* namespace prefix usage
* order of attributes
* <br /> vesus <br></br>
So, until you define "lossless", there's no point in comparing
whether two products are lossless or not.
Transportability also poses problems. For example, if you mean to
move from one system to another, you have to worry about any binary
numbers you store -- some systems store the high-order bytes first,
some store them last. We insisted on making our binaries readable
across platforms, and that involves a lot of byte-swapping overhead
that XML parsers never have to mess with.
4: Hybrid solutions
If you only need to optimize certain operations, you can do it within
XML: Make a pass over the file and add attributes as needed. In the
right setting, this could be really fast (though it's harder than it
looks):
<sec b:next-sibling-offset='99999' prev-sibling-offset='241'...>
Also, if you mainly need to optimize resolving IDREFs, just make a
separate index that says where they are, and leave the XML as is.
XLink works nice for this.
What I'm saying overall is that the solution space is much wider than
it may appear, and the answers are more complex. Also, that it can
be, and has been, done successfully. But except for really huge
documents, I don't think it's usually worth the effort.
Steve
--
Luthien Consulting: Real solutions to hard information management problems
Specializing in XML, schema design, XSLT, and project design/review/repair
Steven J. DeRose, Ph.D., sderose@acm.org
|