OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Binary XML == "spawn of the devil" ?

[ Lists Home | Date Index | Thread Index ]

Elliotte Rusty Harold wrote:
> One of the goals of some of the developers pushing binary XML is to 
> speed up parsing, to provide some sort of preparsed format that is 
> quicker to parse than real XML. I am extremely skeptical that this can 
> be achieved in a platform-independent fashion. Possibly some of the 
> ideas for writing length codes into the data might help, though I doubt 
> they help that much, or are robust in the face of data that violates the 
> length codes.  Nonetheless this is at least plausible.

Speed gains vary greatly with the data, there's little doubt about that. Given 
very loosely tagged document-oriented, er, document will gain much less than a 
markup intensive one carrying typed data. However in my experience there's 
always a gain. The question is always a) is it enough and b) do you need it.

Both (I believe) Cocoon and the Perl/XML projects have simplistic serialisations 
of SAX streams (CXML and XML::Filter::Cache). I haven't looked into either in a 
while, but I remember the latter providing speed-ups circa 3x, which is good, 
especially for something simple (its purpose is that if you can isolate cache 
conditions in a longish or costly SAX pipeline as are common in Perl, then you 
can start from the middle of it and get a faster, cached format).

The issues you raise with platform independence are also real, more on this below.

I'm unsure what you mean about "data that violates the length codes"? Surely if 
the format is not well-formed, it is... not well-formed :) I see little reason 
to be more lenient in a bInfoset than one would be with XML.

> However, this is not the primary preparsing of XML I've seen in existing 
> schemes. A much more common approach assigns types to the data and then 
> writes the data into the file as a binary value that can be directly 
> copied to memory.

In truth I have seen both types, in roughly equal measure. The tokenised 
preparsing approach is the simplest and I believe that in fact there are more 
formats using that approach. However since a number of them were one-offs they 
didn't get to have as much name-recognition. One of the issues with bInfosets is 
that a lot of work on them happens behind closed doors (even working in the 
field it takes long-range ears to hear about some of the projects ;), and given 
the persistence of data they may leak into the larger web (as occasionally it 
has already) at any random point in their existence. My hope with the workshop 
is that it will get people to work together to avoid lock-in  la Flash (an 
inferior format surfaces almost by accident while more promising ones are 
already around or in the works, and takes over the web).

> For example, an integer might be written as a 
> four-byte big-endian int. A floating point number might be written as an 
> eight-byte IEEE-754 double, and so forth. This might speed up things a 
> little in a few cases. However, it's really only going to help on those 
> platforms where the native types match the binary formats. On platforms 
> with varying native binary types, it might well be slower than 
> performing string conversions.

In my experience that sort of conversion is normally faster than converting from 
a string, especially those with the larger lexical spaces (eg scientific 
notation and so on). Imagining a scenario in which it is only as fast, it still 
will be faster on some platforms, so an overall gain is provided.

Some formats take things like endianness into account (with something resembling 
a BOM to flag it), which allows optimised transmission accross platforms that 
don't use network order, when you know who you're talking to. I'm still divided 
on whether this is the right way to do it, on if the speed difference is worth 
it, and on whether it generally is a good idea, but the avenue of thought is 
certainly interesting and worth walking down.

> Unicode decoding is a related issue. It's been suggested that this is a 
> bottleneck in existing parsers, and that directly encoding Unicode 
> characters instead of UTF code points might help. However, since in a 
> binary format you're shipping around bytes, not characters, it's not 
> clear to me how this encoding would be any more efficient than existing 
> encodings such as UTF-8 and UTF-16. If you just want 32-bit characters 
> then use UTF-32. Possibly you could gain some speed by slamming bytes 
> into the native string or wstring type (UTF-16 for Java, possibly other 
> encodings for other languages.) However, as with numeric types this 
> would be very closely tied to the specific language. What worked well 
> for Java might not work well for C or Perl and vice versa.

Same issue, same solutions. Mostly, this is the UTF-8/UTF-16 divide. IIRC, for 
their internal representations Java and Xerces-C use UTF-16, libxml and Perl use 
UTF-8. We noticed that once everything has been optimised, transcoding was a 
very noticeable cost when it was needed. It doesn't at all slow things down 
enough to make the overall gain disappear, but it's there. Again, some formats 
give one the choice, again I'm still going back and forth on whether that's a 
good idea or not.

> I think this 
> could be fully implemented within the bounds of XML 1.0. I don't see why 
> a new serialization format would be necessary to remove this bottleneck 
> from the process.

Because it's not about removing just one bottleneck?

> In summary, I am very skeptical that any prepared format which accepts 
> schema-invalid documents is going to offer significant speedups across 
> different platforms and languages. I do not accept as an axiom that 
> binary formats are naturally faster to parse than text formats. Possibly 
> this can be proved by experiment, but I tend to doubt it.

Binary formats are not "naturally" faster. Even just by digging in this list's 
archives you'll find experiences showing slower binary formats. No surprise 
there, a format that didn't take speed into account to boot may or may not be 
faster depending on how it was done. The whole point of getting together is to 
share such experiences.

Robin Berjon <robin.berjon@expway.fr>
Research Engineer, Expway        http://expway.fr/
7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS