Lists Home |
Date Index |
Elliotte Rusty Harold wrote:
> One of the goals of some of the developers pushing binary XML is to
> speed up parsing, to provide some sort of preparsed format that is
> quicker to parse than real XML. I am extremely skeptical that this can
> be achieved in a platform-independent fashion. Possibly some of the
> ideas for writing length codes into the data might help, though I doubt
> they help that much, or are robust in the face of data that violates the
> length codes. Nonetheless this is at least plausible.
Speed gains vary greatly with the data, there's little doubt about that. Given
very loosely tagged document-oriented, er, document will gain much less than a
markup intensive one carrying typed data. However in my experience there's
always a gain. The question is always a) is it enough and b) do you need it.
Both (I believe) Cocoon and the Perl/XML projects have simplistic serialisations
of SAX streams (CXML and XML::Filter::Cache). I haven't looked into either in a
while, but I remember the latter providing speed-ups circa 3x, which is good,
especially for something simple (its purpose is that if you can isolate cache
conditions in a longish or costly SAX pipeline as are common in Perl, then you
can start from the middle of it and get a faster, cached format).
The issues you raise with platform independence are also real, more on this below.
I'm unsure what you mean about "data that violates the length codes"? Surely if
the format is not well-formed, it is... not well-formed :) I see little reason
to be more lenient in a bInfoset than one would be with XML.
> However, this is not the primary preparsing of XML I've seen in existing
> schemes. A much more common approach assigns types to the data and then
> writes the data into the file as a binary value that can be directly
> copied to memory.
In truth I have seen both types, in roughly equal measure. The tokenised
preparsing approach is the simplest and I believe that in fact there are more
formats using that approach. However since a number of them were one-offs they
didn't get to have as much name-recognition. One of the issues with bInfosets is
that a lot of work on them happens behind closed doors (even working in the
field it takes long-range ears to hear about some of the projects ;), and given
the persistence of data they may leak into the larger web (as occasionally it
has already) at any random point in their existence. My hope with the workshop
is that it will get people to work together to avoid lock-in à la Flash (an
inferior format surfaces almost by accident while more promising ones are
already around or in the works, and takes over the web).
> For example, an integer might be written as a
> four-byte big-endian int. A floating point number might be written as an
> eight-byte IEEE-754 double, and so forth. This might speed up things a
> little in a few cases. However, it's really only going to help on those
> platforms where the native types match the binary formats. On platforms
> with varying native binary types, it might well be slower than
> performing string conversions.
In my experience that sort of conversion is normally faster than converting from
a string, especially those with the larger lexical spaces (eg scientific
notation and so on). Imagining a scenario in which it is only as fast, it still
will be faster on some platforms, so an overall gain is provided.
Some formats take things like endianness into account (with something resembling
a BOM to flag it), which allows optimised transmission accross platforms that
don't use network order, when you know who you're talking to. I'm still divided
on whether this is the right way to do it, on if the speed difference is worth
it, and on whether it generally is a good idea, but the avenue of thought is
certainly interesting and worth walking down.
> Unicode decoding is a related issue. It's been suggested that this is a
> bottleneck in existing parsers, and that directly encoding Unicode
> characters instead of UTF code points might help. However, since in a
> binary format you're shipping around bytes, not characters, it's not
> clear to me how this encoding would be any more efficient than existing
> encodings such as UTF-8 and UTF-16. If you just want 32-bit characters
> then use UTF-32. Possibly you could gain some speed by slamming bytes
> into the native string or wstring type (UTF-16 for Java, possibly other
> encodings for other languages.) However, as with numeric types this
> would be very closely tied to the specific language. What worked well
> for Java might not work well for C or Perl and vice versa.
Same issue, same solutions. Mostly, this is the UTF-8/UTF-16 divide. IIRC, for
their internal representations Java and Xerces-C use UTF-16, libxml and Perl use
UTF-8. We noticed that once everything has been optimised, transcoding was a
very noticeable cost when it was needed. It doesn't at all slow things down
enough to make the overall gain disappear, but it's there. Again, some formats
give one the choice, again I'm still going back and forth on whether that's a
good idea or not.
> I think this
> could be fully implemented within the bounds of XML 1.0. I don't see why
> a new serialization format would be necessary to remove this bottleneck
> from the process.
Because it's not about removing just one bottleneck?
> In summary, I am very skeptical that any prepared format which accepts
> schema-invalid documents is going to offer significant speedups across
> different platforms and languages. I do not accept as an axiom that
> binary formats are naturally faster to parse than text formats. Possibly
> this can be proved by experiment, but I tend to doubt it.
Binary formats are not "naturally" faster. Even just by digging in this list's
archives you'll find experiences showing slower binary formats. No surprise
there, a format that didn't take speed into account to boot may or may not be
faster depending on how it was done. The whole point of getting together is to
share such experiences.
Robin Berjon <email@example.com>
Research Engineer, Expway http://expway.fr/
7FC0 6F5F D864 EFB8 08CE 8E74 58E6 D5DB 4889 2488