Re: "Binary XML" proposals

No, no, no. One cannot make generalizations about binary v. text like that: only specific formats and implementations sending average instances of specific types of documents over specific protocols with specific binary-handling disciplines.

It is completely possible to make inefficient binary formats (e.g. the Sun example I mentioned before where the XML form was 10 to 100 times smaller), or ones with performance penalties. It is completely possible to provide indexes in XML documents (e.g. my previous posting). It is possible to provide multipart documents with an XML document and a binary index for searching. It is possible to provide non-XML text formats that have nice performance characteristics (e.g. Steve DeRose's patented indexing method using fully qualified names) or my STAX short-tagging compression which can give well over 50% reduction in file size (in suitable cases) for just a paragraph of extra lines of non-processor-taxing code inside an XML parser. And there are more efficient parsers possible (especially for trusted data) if they assume WF documents (e.g. Tim Bray's comment on implementing GI tokenizing based on whitespace rather than checking that the next character is not a NAME character.)

It would be possible to provide some metadata (in the HTTP header? in a PI?) giving the element/attribute/string counts to allow some bulk allocation of objects as an array: if object allocation is one of the most expensive bottlenecks in some languages, this could allow more efficient implementations of XML DOMs. (The binary format might also benefit from this, but as XML gets more efficiently implementable, the justification for binary formats weakens correspondingly. Actually, it is probably a case of XML getting the foot in the door for shipping trees around, and then particular sectors finding the optimal format such as ASN.1 later: XML as a bootstrapping/prototyping notation. )

And binary encodings that require a schema may not give much advantage of compression if the document has no schema. And, when considering schemas, also consider that SGML (which can achieve very sparse markup and much smaller filesizes) may, if efficiently implemented, be only as taxing (in terms of machine operations performed) as XML.

And there is also the other cat in the bag: sparse, lazy DOMs (i.e. DOMs constructed lazily as required from a fragment server) may require far less processing than retrieving full documents whether those documents are sent as XML or non-XML.

Another example of inappropriate processing might be a SOAP router that creates a DOM or SAX stream for the whole document, when it might be better to merely read the header (mini-DOM or mini-SAX stream or however) and while treating the rest of the document as a byte array or stream.

So Anders is right that the use-case is important, but the use-case is not merely readability, however excellent that constantly shows itself to be. A lot of the supposed benefits of a binary format may be nothing to do with the binary-nature itself, and just as doable in vanilla XML or in a text format.

I think Gavin's other point is closer to the mark: the higher the ratio of data to markup the less processing difference there can be between any formats or notations (since any differences in text is a matter of compression, whether string-by-string in a binary format or on the whole entity for XML is irrelevent.) But the ratio of text-to-markup cannot be predicted from the average size of a document, in general: for particular document types and instance sets it can be, of course.