> As always it comes down to Use
Cases, and
> - if readability by Humas requirement is important / prioritized then Text based > ML wins.
> - If performance is important that usually Binary ML wins. No, no, no. One cannot make generalizations about binary
v. text like that: only specific formats and implementations sending average
instances of specific types of documents over specific protocols with specific
binary-handling disciplines.
It is completely possible to make inefficient binary formats
(e.g. the Sun example I mentioned before where the XML form was 10 to 100 times
smaller), or ones with performance penalties. It is
completely possible to provide indexes in XML documents (e.g. my previous
posting). It is possible to provide multipart documents with an XML document and
a binary index for searching. It is possible to provide non-XML text
formats that have nice performance characteristics (e.g. Steve
DeRose's patented indexing method using fully qualified names) or my STAX
short-tagging compression which can give well over 50% reduction in file size
(in suitable cases) for just a paragraph of extra lines of
non-processor-taxing code inside an XML parser. And there are more
efficient parsers possible (especially for trusted data) if they assume WF
documents (e.g. Tim Bray's comment on implementing GI tokenizing based on
whitespace rather than checking that the next character is not a NAME
character.)
It would be possible to provide some metadata (in the HTTP
header? in a PI?) giving the element/attribute/string counts to allow some bulk
allocation of objects as an array: if object allocation is one of the most
expensive bottlenecks in some languages, this could allow more efficient
implementations of XML DOMs. (The binary format might also benefit from this,
but as XML gets more efficiently implementable,
the justification for binary formats weakens correspondingly.
Actually, it is probably a case of XML getting the foot in the door
for shipping trees around, and then particular sectors finding the optimal
format such as ASN.1 later: XML as a
bootstrapping/prototyping notation. )
And binary encodings that require a schema may not
give much advantage of compression if the document has no schema. And,
when considering schemas, also consider that SGML (which can achieve very sparse
markup and much smaller filesizes) may, if efficiently implemented, be
only as taxing (in terms of machine operations performed) as
XML.
And there is also the other cat in the bag: sparse, lazy DOMs
(i.e. DOMs constructed lazily as required from a fragment server) may require
far less processing than retrieving full documents whether those documents are
sent as XML or non-XML.
Another example of inappropriate processing might be a SOAP
router that creates a DOM or SAX stream for the whole document, when it might be
better to merely read the header (mini-DOM or mini-SAX stream or however) and
while treating the rest of the document as a byte array or stream.
So Anders is right that the use-case is important, but the
use-case is not merely readability, however excellent that constantly shows
itself to be. A lot of the supposed benefits of a binary format may be
nothing to do with the binary-nature itself, and just as doable in vanilla XML
or in a text format.
I think Gavin's other point is closer to the mark: the
higher the ratio of data to markup the less processing difference
there can be between any formats or notations (since any differences in
text is a matter of compression, whether string-by-string in a binary
format or on the whole entity for XML is irrelevent.) But
the ratio of text-to-markup cannot be predicted from the average size of a
document, in general: for particular document types and instance sets it can be,
of course.
Cheers
Rick Jelliffe
|