OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Binary XML == "spawn of the devil" ?

[ Lists Home | Date Index | Thread Index ]

Apologies for restarting this thread. I've just returned from my 
vacation, and I'm working my way through a lot of e-mail that built 
up. Having read this entire thread now. there's one issue I noticed 
that's been feinted at a couple of times, but nobody seems to have 
taken it head-on. So please allow me to do that now.

One of the goals of some of the developers pushing binary XML is to 
speed up parsing, to provide some sort of preparsed format that is 
quicker to parse than real XML. I am extremely skeptical that this 
can be achieved in a platform-independent fashion. Possibly some of 
the ideas for writing length codes into the data might help, though I 
doubt they help that much, or are robust in the face of data that 
violates the length codes.  Nonetheless this is at least plausible.

However, this is not the primary preparsing of XML I've seen in 
existing schemes. A much more common approach assigns types to the 
data and then writes the data into the file as a binary value that 
can be directly copied to memory. For example, an integer might be 
written as a four-byte big-endian int. A floating point number might 
be written as an eight-byte IEEE-754 double, and so forth. This might 
speed up things a little in a few cases. However, it's really only 
going to help on those platforms where the native types match the 
binary formats. On platforms with varying native binary types, it 
might well be slower than performing string conversions.

Unicode decoding is a related issue. It's been suggested that this is 
a bottleneck in existing parsers, and that directly encoding Unicode 
characters instead of UTF code points might help. However, since in a 
binary format you're shipping around bytes, not characters, it's not 
clear to me how this encoding would be any more efficient than 
existing encodings such as UTF-8 and UTF-16. If you just want 32-bit 
characters then use UTF-32. Possibly you could gain some speed by 
slamming bytes into the native string or wstring type (UTF-16 for 
Java, possibly other encodings for other languages.) However, as with 
numeric types this would be very closely tied to the specific 
language. What worked well for Java might not work well for C or Perl 
and vice versa.

Nonetheless it should be doable. A Java parser that worked directly 
on UTF-16 code points and did not directly decode characters should 
be able to be implemented. Verifying the well-formedness of surrogate 
pairs might be more expensive, but is rarely needed in practice. I 
think this could be fully implemented within the bounds of XML 1.0. I 
don't see why a new serialization format would be necessary to remove 
this bottleneck from the process.

In summary, I am very skeptical that any prepared format which 
accepts schema-invalid documents is going to offer significant 
speedups across different platforms and languages. I do not accept as 
an axiom that binary formats are naturally faster to parse than text 
formats. Possibly this can be proved by experiment, but I tend to 
doubt it.
-- 

   Elliotte Rusty Harold
   elharo@metalab.unc.edu
   Processing XML with Java (Addison-Wesley, 2002)
   http://www.cafeconleche.org/books/xmljava
   http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS