xml-dev - Re: [xml-dev] Pushing all the buttons

Re: [xml-dev] Pushing all the buttons

[ Lists Home | Date Index | Thread Index ]

To: XML Dev <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Pushing all the buttons
From: Rick Jelliffe <ricko@allette.com.au>
Date: Sun, 21 Sep 2003 21:47:38 +1000
In-reply-to: <20030920193326.72006.qmail@web13701.mail.yahoo.com>
References: <20030920193326.72006.qmail@web13701.mail.yahoo.com>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3.1) Gecko/20030428

Mike Champion wrote:

>As best I know, the big win for truly binary XML
>serializations is in avoiding the overhead of the
>Unicode-encoded text to UCS-character translation. 
>Does anyone take issue with the assertion that the
>external encoding-> Unicode text translation is
>generally a significant portion of XML parsing time?  
>  
>
Yes?  Transcoding ASCII, ISO8859-1 or UTF-16 is just a cast;
translating UTF-8 is a tiny automaton, easily enough to fit into
a data cache; translating most 8-bit sets needs only a 94 byte table.
There is nothing intrinsic to any of them that should make them
slow, the code to do them could fit into instruction caches on CPUs
(which is surely what people who want speed should be concentrating on:
what is the most functionality that a standard can prescribe that still
fits into caches):  it reckon it should be more an API/implementation 
issue.*

Java 1.4 NIO has completely revised their character transcoding:
you can have transcoders that autodetect, so I don't know why
someone doesn't put out an XML-autodetecting transcoder, which
would operate directly on, for example, external byte buffers. That
could give much nicer streaming performance.  (Anyone have any
benchmarks for NIO b.t.w.?)

The CJK sets, EBCDIC, perhaps encodings with ordering requirements such
as Thai, and older sets which need normalization are a different matter:
they are not casts, simple automata nor little tables. But removing these
from XML will not result in any extra capability for users: if you need 
speed,
send easy data.

Cheers
Rick Jelliffe

* For example, I found that IBM's ICU4J normalization class was way too 
slow
when  presented with ASCII data; but a trivial matter to bypass.

Follow-Ups:
- Re: [xml-dev] Pushing all the buttons
  - From: John Cowan <jcowan@reutershealth.com>

References:
- Re: [xml-dev] Pushing all the buttons
  - From: Mike Champion <mc@xegesis.org>

Prev by Date: Re: [xml-dev] some days I'm wrong
Next by Date: Re: [xml-dev] Pushing all the buttons
Previous by thread: Re: [xml-dev] Pushing all the buttons
Next by thread: Re: [xml-dev] Pushing all the buttons
Index(es):
- Date
- Thread