OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Binary XML - summary of discussion to date

On Sat, Apr 14, 2001 at 08:35:45PM +0100, Al Snell wrote:
>On Sun, 15 Apr 2001, Danny Ayers wrote:
>> I could pick pretty much any target platform, pretty much any programming
>> language, pretty much any application I could imagine (that needed data
>> transfer and/or persistence) and use of-the-peg components for XML support.
>XDR runtime libraries and development kits are probably more widespread
>than XML runtimes, since an XDR toolkit of some kind comes with every OS
>that can do NFS (certainly every Unix I've encountered), but because

Not a terribly wide base of examples, actually.  NFS is very uncommon,
and often implemented in a high-cost package, outside of the various
unixen and unix-alike OSen.

>hardly anyone knows this, very few programming languages have XDR or
>ONC-RPC bindings :-(
>Shame, really. If they did, then interoperable data transfer would be
>considered a non-problem long ago...

Bloody *unlikely*.

XDR, which you've extolled as an example, is a disaster for high-speed
handling, like BER (BER actually suffers from ambiguity of expression
as well).

A little over five years ago, I worked for a small contract programming
firm, and one of the services that we sold was protocol decodes for
network analysis hardware (to three competing manufacturers, no less;
the company wasn't at all embarassed by sleaze).

Typically, for network analysis hardware, you've got two requirements:
grab *everything*, display *only* what the user says he cares about.

"Binary" protocols, btw, do not exist, as such.  TCP, IP, UDP, ARP (and
IPX, DDP, DRP, even ethernet frames and most other such things) have
headers that are record oriented.  It isn't that it's faster to read a
network-byte order sixteen-bit integer on *every* possible
platform--it's that IP always puts the length field in *exactly* the
same offset from the start of the frame, so even if it's encoded as
BCD, or as a four-character text field, I can get it just as fast.

If it's XDR, I don't know where it is, and in order to find it, I have
to parse every byte that comes in front of it.  The generalization
about the allegedly self-describing binary formats is that they're a
pain to decode in a hurry.

XML is actually *easier* in this context (seek-in-a-hurry), because it
reserves all sorts of characters as special ... most notably, pointy
brackets.  If I were still doing this stuff, I'd consider XML to be my
darling baby, even easier than text (which is dead easy, no matter
what, 'cause regex libraries are easy to get, and the user doesn't
have to know that they're using regular expressions (well, they aren't;
we translated into re)).  If I had to do decodes for content filtering
(I can just imagine some of the customers of our customers going like
"Wull, if I *paid* fifty thousand dollars for this, I wanna see who's
looking at porNOGraphy"), text was my darling (it was already done,
actually, and we just sold 'em another acronym over the same code:
"Sure, we can support ESMTP!  No, it'll cost just as much as HTTP,
SMTP, and IMAP").  First time I saw BXXP, I just laughed out loud.

In this context, the "self-describing binary" formats *must* *allow*
every possible value to be representable.  For XDR and BER, that meant
that you started munching at the front, and branched when the
self-describing data described a branch.  No offsets, no cheap
searches.  No two-byte, four-byte, whatever integers, except more or
less by accident, and because everything could (in theory) keep growing
and growing and growing and growing, then you had to parse everything
until you found the particular bit you needed (even if you know that
there are four integers in front of the particular integer you want,
you have to parse all four, because they're variable length, and you
*could* get the wrong one by making ass u m ptions).  If you can use
records, like IP or TCP etc., use them.  Offsets are easy.  Otherwise,
if it's going to be variable length, it's going to hurt, probably.  But
if there are delimiters to look for (newline, typically but not always
as defined for the NVT--pointy brackets for XML), that makes the
variable length easier ('cause you can ignore anything that isn't a
delimiter, just as soon as you figure out whether the thing after the
delimiter is something you care about -- using XML, equipping a filter
with an XPath of the form: /message/header/sender/Sender, look for '<,
then keep going to the next if the first character after isn't m, until
your stack state says "got message, look for header" ... oh, fill in
the rest).

In short, I'd be very surprised if a variable-length binary encoding
is any faster to parse (if all you want is a subset of information)
than XML; it's certainly very likely to be harder to read.

Writing decodes for SNMP (ASN.1/BER) and the RPC based suites (XDR,
that is) was hell.  It isn't that no one knows about them.  They aren't
as suitable to the general purpose of XML as XML is.  Even XML-for-data
has the advantage, when discussion goes from machine to machine, that
there is no real discussion of byte order (trivia question: do all
current network protocols in wide use (not just TCP/IP, mind) use
network byte order for transmission of multi-byte abstractions?).

Amy! (who has now not smoked for eighteen hours, and seems to be
heading rapidly in the direction of verbosity).
Amelia A. Lewis          alicorn@mindspring.com          amyzing@talsever.com
"Oh, fuck!  You did it just like I told you to!"  (The manager's lament)