[
Lists Home |
Date Index |
Thread Index
]
- From: David Megginson <david@megginson.com>
- To: XML-Dev Mailing list <xml-dev@ic.ac.uk>
- Date: Sat, 27 Feb 1999 14:02:12 -0500 (EST)
Tom Harding writes:
> How? You would doubtless agree that mandating a specific encoding
> for all streams sidesteps one of the major benefits of XML.
> Introducing an encoding declaration mechanism into the transport
> protocol, as HTTP does, would duplicate the function of the XML
> processor.
Here's a short excerpt from the non-normative Appendix F of the XML
1.0 Recommendation:
The second possible case occurs when the XML entity is accompanied by
encoding information, as in some file systems and some network
protocols. When multiple sources of information are available, their
relative priority and the preferred method of handling conflict should
be specified as part of the higher-level protocol used to deliver
XML. Rules for the relative priority of the internal label and the
MIME-type label in an external header, for example, should be part of
the RFC document defining the text/xml and application/xml MIME
types. In the interests of interoperability, however, the following
rules are recommended.
- If an XML entity is in a file, the Byte-Order Mark and
encoding-declaration PI are used (if present) to determine the
character encoding. All other heuristics and sources of
information are solely for error recovery.
- If an XML entity is delivered with a MIME type of text/xml, then
the charset parameter on the MIME type determines the character
encoding method; all other heuristics and sources of information
are solely for error recovery.
- If an XML entity is delivered with a MIME type of application/xml,
then the Byte-Order Mark and encoding-declaration PI are used (if
present) to determine the character encoding. All other heuristics
and sources of information are solely for error recovery.
These rules apply only in the absence of protocol-level documentation;
in particular, when the MIME types text/xml and application/xml are
defined, the recommendations of the relevant RFC will supersede these
rules.
If I were defining a streaming protocol for e-commerce, news,
financial markets, etc., I probably would mandate a single encoding
for all packets (UTF-8 or UTF-16), just to keep things simple. As you
can see in the above excerpt, the character-set discover heuristics in
XML are intended for use only in the absence of protocol-specific
encoding information.
<snip/>
> It's amazing how two people can see things so differently. I think
> it's supremely elegant that only the XML processor needs to look at
> data coming off the wire. It's also as efficient as it gets.
It is efficient only if you know for certain that you need to use a
single object model for all of the XML information that you're
receiving; otherwise, you'll end up building a generic object model
(like a DOM), then tearing it down to build an optimised
domain-specific one (such as a vector graphic or a
financial-transaction object), and that process would be painful.
> course the software architecture that handles the documents emitted
> must be modular and extensible, but the task of parsing is done.
Parsing is relatively easy (though it's wasteful to do it twice);
building an object model from the parsing is time- and
resource-consuming. For example, imagine that I have a Java class
like this:
public class Purchase {
public int seqno;
public int customerId;
public int vendorId;
public int invoiceId;
public float total;
}
In XML, an instance of this information might look like this:
<purchase xmlns="http://www.ecommerce.net/ns/ec/">
<seqno>12345678</seqno>
<customer-id>87654321</customer-id>
<vendor-id>18273645</vendor-id>
<invoice-id>81726354</invoice-id>
<total>92674.12</total>
</purchase>
Based on my (limited) understanding of the Java VM, the Java versions
of a Purchase objects will require 24 bytes of storage each; I'd guess
that even a heavily-optimised generic DOM implementation would require
at least 5-10 times as much storage (I'll welcome corrections from any
DOM implementors on this list).
In other words, if I go straight from the XML to my own object model,
I can store 100,000 purchases in 2,400,000 bytes of storage; if I go
from XML to a generic DOM object model, I will require between
12,000,000 and 24,000,000 (or more) bytes to store the same
information, and then I will *still* have to build my own object model
afterwards.
All the best,
David
--
David Megginson david@megginson.com
http://www.megginson.com/
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)
|