Lists Home |
Date Index |
Mike Champion wrote:
>As best I know, the big win for truly binary XML
>serializations is in avoiding the overhead of the
>Unicode-encoded text to UCS-character translation.
>Does anyone take issue with the assertion that the
>external encoding-> Unicode text translation is
>generally a significant portion of XML parsing time?
Yes? Transcoding ASCII, ISO8859-1 or UTF-16 is just a cast;
translating UTF-8 is a tiny automaton, easily enough to fit into
a data cache; translating most 8-bit sets needs only a 94 byte table.
There is nothing intrinsic to any of them that should make them
slow, the code to do them could fit into instruction caches on CPUs
(which is surely what people who want speed should be concentrating on:
what is the most functionality that a standard can prescribe that still
fits into caches): it reckon it should be more an API/implementation
Java 1.4 NIO has completely revised their character transcoding:
you can have transcoders that autodetect, so I don't know why
someone doesn't put out an XML-autodetecting transcoder, which
would operate directly on, for example, external byte buffers. That
could give much nicer streaming performance. (Anyone have any
benchmarks for NIO b.t.w.?)
The CJK sets, EBCDIC, perhaps encodings with ordering requirements such
as Thai, and older sets which need normalization are a different matter:
they are not casts, simple automata nor little tables. But removing these
from XML will not result in any extra capability for users: if you need
send easy data.
* For example, I found that IBM's ICU4J normalization class was way too
when presented with ASCII data; but a trivial matter to bypass.