xml-dev - Re: [xml-dev] The Rising Sun: How XML Binary Restored theFortunes of In

Re: [xml-dev] The Rising Sun: How XML Binary Restored theFortunes of In

[ Lists Home | Date Index | Thread Index ]

To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] The Rising Sun: How XML Binary Restored theFortunes of Innovators
From: "Steven J. DeRose" <sderose@acm.org>
Date: Fri, 8 Apr 2005 11:16:55 -0400
In-reply-to: <4255B9CF.50606@lig.net>
References: <15725CF6AFE2F34DB8A5B4770B7334EE07206DA4@hq1.pcmail.ingr.com> <425561B3.7070106@metalab.unc.edu><e3a5cb2c050407110544e18e7f@mail.gmail.com><4255B267.1090401@zenucom.com> <4255B9CF.50606@lig.net>

This has been interesting.

A few points that I think are important:

1:   What operations do you want to do?

Whether binary formats "are faster" depends on this more than anything else.

If you want to "generate the DSIG digital signature" it's hard to 
imagine that a binary format could ever be faster, because the DSIG 
is generally defined on the text stream -- so to correctly calculate 
it, you'd have to crank through your binary structure and re-create 
the text stream on the fly. That costs more time than simply scanning 
a text stream that's already there.

On the other hand, if you want to "skip to the next section" a 
halfway decent binary format will clean up. That's because it will 
have a pointer and get there in one step, while the parser has to go 
looking. Even if the looking is really quick, it's slower than not 
looking at all.

If you want to examine the XPath "ancestors" axis, resolve IDREFs, 
and so on, you'll be best off with a binary thing that knows all that 
already because it figured it out when it was first generated. If you 
want "preceding", all bets are off unless the implementor 
specifically designed for it. You want all occurrences of "the", 
binary and text will be about the same, and you'd need a full-text 
index to improve it.

This reminds me to mention that what someone said about it being 
impossible to say what is faster because implementations differ, is 
oversimplified. Of course an incompetent implementation of either 
type can manage to be immeasurably bad. But some algorithms are 
inherently faster than others; and binary representations have a 
larger choice of algorithms.

2:   Where is the data kept?

Often the biggest speed factor of all is what data is in RAM vs. disk 
vs. over the net. RAM is about 10,000 times faster than disk (very 
expensive disk seeks). Binary formats, historically, are intended to 
overcome this obstacle.

Jeff Vogel and I wrote the first binary SGML implementation that 
could handle large documents, in early 1990. Our staff tweaked its 
parser for XML later. Because we didn't have to touch the binary 
representation at all, it would not be stretching much to say we had 
a binary XML representation up and running in 1990. For the usual 
operations required to search and render documents, nothing I've seen 
yet has been faster for big documents. But it is uncommon these days 
that single XML documents are too big to be kept in RAM.

The company was Electronic Book Technologies, the product was 
DynaText, and it was mainly used for *really* big documents, like 
F-16 manuals that on paper would outweigh the plane. Typical size for 
a *single* document was 10-250MB. You could open a document that big, 
go to anyplace determined by an XPath-like expression, render, and 
have the text on the screen in about 1 second. If you want an 
interesting contrast, make yourself a 1MB HTML file, open it in a 
browser, scroll to the bottom, and then resize horizontally.

On the other hand, it was purely a delivery system, and you couldn't 
update in place although the binary format used theoretically could. 
There are at least 11 patents on it, so anyone can go see one way to 
design a binary XML format that fast (though some of the cooler 
tweaks were post-patent). One expects that any committees involved 
would do that. Perhaps after 15 years they could do something 
substantially better -- but we'll see.


3: What does "lossless" mean?

A few other recent postings have mentioned this issue. I think most 
people would consider a format "lossless" if you could export from it 
back into XML syntax, and when you parsed the resulting XML you got 
the same DOM as for the original document. If that's enough, it's not 
hard to make a lossless binary format (and mine was lossless, except 
I think it discarded comments and PIs). HOWEVER, this is not 
completely lossless. You still lose (among other things):

    * the entity structure

    * being able to get a matching DSIG

    * all sorts of really ugly whitespace normalization details 
(including within tags)

    * single- versus double-quoting of attributes

    * namespace prefix usage

    * order of attributes

    * <br /> vesus <br></br>

So, until you define "lossless", there's no point in comparing 
whether two products are lossless or not.

Transportability also poses problems. For example, if you mean to 
move from one system to another, you have to worry about any binary 
numbers you store -- some systems store the high-order bytes first, 
some store them last. We insisted on making our binaries readable 
across platforms, and that involves a lot of byte-swapping overhead 
that XML parsers never have to mess with.

4:   Hybrid solutions

If you only need to optimize certain operations, you can do it within 
XML: Make a pass over the file and add attributes as needed. In the 
right setting, this could be really fast (though it's harder than it 
looks):

      <sec b:next-sibling-offset='99999' prev-sibling-offset='241'...>

Also, if you mainly need to optimize resolving IDREFs, just make a 
separate index that says where they are, and leave the XML as is. 
XLink works nice for this.


What I'm saying overall is that the solution space is much wider than 
it may appear, and the answers are more complex. Also, that it can 
be, and has been, done successfully. But except for really huge 
documents, I don't think it's usually worth the effort.

Steve

-- 
Luthien Consulting: Real solutions to hard information management problems
    Specializing in XML, schema design, XSLT, and project design/review/repair
Steven J. DeRose, Ph.D., sderose@acm.org

References:
- The Rising Sun: How XML Binary Restored the Fortunes of Innovators
  - From: "Bullard, Claude L (Len)" <len.bullard@intergraph.com>
- Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunesof Innovato rs
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>
- Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunes of Innovato rs
  - From: Michael Champion <michaelc.champion@gmail.com>
- Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunesof Innovato rs
  - From: Rick Marshall <rjm@zenucom.com>
- Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunesof Innovato rs
  - From: "Stephen D. Williams" <sdw@lig.net>

Prev by Date: Re: [xml-dev] [ANN] smallx XML Infoset and Pipeline Released (OpenSource)
Next by Date: RE: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunes of Innovators
Previous by thread: Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunesof Innovato rs
Next by thread: Re: [xml-dev] The Rising Sun: How XML Binary Restored the Fortunesof Innovato rs
Index(es):
- Date
- Thread