Lists Home |
Date Index |
At 8:33 AM -0400 9/21/03, David Megginson wrote:
>I think that James was talking about going from bytes representing a
>Unicode character encoding, not a binary encoding. There should be no
>platform dependencies in that case.
I understood that, and my point still holds. There are platform
dependencies in this case. If the native char and string types are
built on UTF-8 (Perl, maybe?) then this is straightforward.,
However, when the native char and string types are based on UTF-16 a
conversion is necessary. Ditto for UTF-16BE to UTF-16LE and vice
versa. Or UTF-8/UTF-16 --> UTF-32. Languages and platforms do not
share the same internal representations of Unicode. No one binary
format will work for everyone.
This conversion is non-trivial too. In the current version of XOM I
made deliberate decision after profiling to store internal text node
data in UTF-8 rather than UTF-16. That saves me a *lot* of memory.
However, the constant conversion to and from the internal UTF-8
representation to Java's UTF-16 representation imposes about a 10%
speed penalty. I chose to optimize for size instead of speed in this
case, but I wouldn't suggest imposing that cost on everyone by making
all XML data UTF-8.
Elliotte Rusty Harold
Processing XML with Java (Addison-Wesley, 2002)