Lists Home |
Date Index |
Yes, well you can see that I'm yet another victim of
On Tue, 19 Apr 2005 7:47 pm, David Carlisle wrote:
> One point I would like to make is from the python link
> (http://www.oreillynet.com/pub/wlg/6291) where mention
> is made to the assumption of parsing 8-bit text documents
> when Unicode docs may be the norm in the future.
> Unicode encodings are already the default encodings in windows and more
> recent linux distributions, so that would be now, not "in the future".
> Unicode is (according to my understanding) an 8-bit escaping system. That
> is if the character is extended, it is written into a second, third and
> then consecutive bytes if required.
> No, That is (more or less) a description of UTF-8. Unicode itself has
> nothing to do with bytes or encodings, it is a mapping of a set
> of characters (with associated names and other properties) to numbers in
> the range hex 1 to 10FFFF.
> > So to do really *fast* unicode stuff, ideally, the in-memory view
> > wouldn't store the characters in 8-bit, but just as 32-bit (4 byte) or
> > 64-bit (8 byte) strings.
> That would be UCS4 encoding (otherwise known as utf-32)
> utf-16 is also common, probably more so than utf-8 (java uses utf-16 by
> default as does msxml).
> This e-mail has been scanned for all viruses by Star. The
> service is powered by MessageLabs. For more information on a proactive
> anti-virus service working around the clock, around the globe, visit:
Computergrid : The ones with the most connections win.