xml-dev - RE: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunes of I

RE: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunes of I

[ Lists Home | Date Index | Thread Index ]

To: "'Steven J. DeRose'" <sderose@acm.org>, xml-dev@lists.xml.org
Subject: RE: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunes of Innovators
From: "Bullard, Claude L (Len)" <len.bullard@intergraph.com>
Date: Fri, 8 Apr 2005 10:49:31 -0500

Thanks for this analysis, Steve.  Much appreciated.

The oft-mentioned issues here are hardware related, specifically, 
large arrays of small sensors that communicate back to software. 
The applications aren't hard to imagine.   Whereas one can argue 
that XML is an inappropriate format for these devices, this is 
a more severe case than that for cellphones and other portable 
devices (may or may not be battery starved).  They are asked to 
use XML and only then do the implications become apparent (design 
by committee and RFP).

In the case of XML documents such as X3D, they simply are very 
large and there are datatype issues.  GZIP and modem compression 
proved insufficient over time.  Also, some customers demand 
a binary.  They don't expose their content to inspection.  Yes, it 
can be cracked; no, they don't care.  The rendering of real time 
systems apparently is also an issue, and no this is not my area of 
expertise, but slowing down a real time framerate is the ultimate 
sin in that world.  In the case of X3D, a binary is just another 
encoding in a standard that was built for three encodings and uses 
the object model to keep these coherent.  These customers want to 
use web services with the distributed simulations (XMSF).  FI is 
an important contender in the organization responsible for that 
as Sun has been a faithful participant since the beginning of VRML, 
and now X3D.  (Showing up is 90% of many efforts.)

len


From: Steven J. DeRose [mailto:sderose@acm.org]

This has been interesting.

A few points that I think are important:

1:   What operations do you want to do?

Whether binary formats "are faster" depends on this more than anything else.

If you want to "generate the DSIG digital signature" it's hard to 
imagine that a binary format could ever be faster, because the DSIG 
is generally defined on the text stream -- so to correctly calculate 
it, you'd have to crank through your binary structure and re-create 
the text stream on the fly. That costs more time than simply scanning 
a text stream that's already there.

On the other hand, if you want to "skip to the next section" a 
halfway decent binary format will clean up. That's because it will 
have a pointer and get there in one step, while the parser has to go 
looking. Even if the looking is really quick, it's slower than not 
looking at all.

If you want to examine the XPath "ancestors" axis, resolve IDREFs, 
and so on, you'll be best off with a binary thing that knows all that 
already because it figured it out when it was first generated. If you 
want "preceding", all bets are off unless the implementor 
specifically designed for it. You want all occurrences of "the", 
binary and text will be about the same, and you'd need a full-text 
index to improve it.

This reminds me to mention that what someone said about it being 
impossible to say what is faster because implementations differ, is 
oversimplified. Of course an incompetent implementation of either 
type can manage to be immeasurably bad. But some algorithms are 
inherently faster than others; and binary representations have a 
larger choice of algorithms.

2:   Where is the data kept?

Often the biggest speed factor of all is what data is in RAM vs. disk 
vs. over the net. RAM is about 10,000 times faster than disk (very 
expensive disk seeks). Binary formats, historically, are intended to 
overcome this obstacle.

Jeff Vogel and I wrote the first binary SGML implementation that 
could handle large documents, in early 1990. Our staff tweaked its 
parser for XML later. Because we didn't have to touch the binary 
representation at all, it would not be stretching much to say we had 
a binary XML representation up and running in 1990. For the usual 
operations required to search and render documents, nothing I've seen 
yet has been faster for big documents. But it is uncommon these days 
that single XML documents are too big to be kept in RAM.

The company was Electronic Book Technologies, the product was 
DynaText, and it was mainly used for *really* big documents, like 
F-16 manuals that on paper would outweigh the plane. Typical size for 
a *single* document was 10-250MB. You could open a document that big, 
go to anyplace determined by an XPath-like expression, render, and 
have the text on the screen in about 1 second. If you want an 
interesting contrast, make yourself a 1MB HTML file, open it in a 
browser, scroll to the bottom, and then resize horizontally.

On the other hand, it was purely a delivery system, and you couldn't 
update in place although the binary format used theoretically could. 
There are at least 11 patents on it, so anyone can go see one way to 
design a binary XML format that fast (though some of the cooler 
tweaks were post-patent). One expects that any committees involved 
would do that. Perhaps after 15 years they could do something 
substantially better -- but we'll see.


3: What does "lossless" mean?

A few other recent postings have mentioned this issue. I think most 
people would consider a format "lossless" if you could export from it 
back into XML syntax, and when you parsed the resulting XML you got 
the same DOM as for the original document. If that's enough, it's not 
hard to make a lossless binary format (and mine was lossless, except 
I think it discarded comments and PIs). HOWEVER, this is not 
completely lossless. You still lose (among other things):

    * the entity structure

    * being able to get a matching DSIG

    * all sorts of really ugly whitespace normalization details 
(including within tags)

    * single- versus double-quoting of attributes

    * namespace prefix usage

    * order of attributes

    * <br /> vesus <br></br>

So, until you define "lossless", there's no point in comparing 
whether two products are lossless or not.

Transportability also poses problems. For example, if you mean to 
move from one system to another, you have to worry about any binary 
numbers you store -- some systems store the high-order bytes first, 
some store them last. We insisted on making our binaries readable 
across platforms, and that involves a lot of byte-swapping overhead 
that XML parsers never have to mess with.

4:   Hybrid solutions

If you only need to optimize certain operations, you can do it within 
XML: Make a pass over the file and add attributes as needed. In the 
right setting, this could be really fast (though it's harder than it 
looks):

      <sec b:next-sibling-offset='99999' prev-sibling-offset='241'...>

Also, if you mainly need to optimize resolving IDREFs, just make a 
separate index that says where they are, and leave the XML as is. 
XLink works nice for this.


What I'm saying overall is that the solution space is much wider than 
it may appear, and the answers are more complex. Also, that it can 
be, and has been, done successfully. But except for really huge 
documents, I don't think it's usually worth the effort.

Steve

Follow-Ups:
- Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunes of Innovators
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>

Prev by Date: Re: [xml-dev] The Rising Sun: How XML Binary Restored theFortunes of Innovators
Next by Date: Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunes of Innovators
Previous by thread: Java library for parsing customer Type definition
Next by thread: Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunes of Innovators
Index(es):
- Date
- Thread